So we were recently given OpenCL, and nVidia was so nice as to provide both a 32bit and a 64bit version.
Since GPU programming is cool I made a nice wrapper for it (currently unfinished but (and this is the best part) workable) in C#, and did a couple of tests with it.
I found out that (on at least my GPU) there is a smallish limit on the output size, which is a shame, but we could of course come up with some clever tiling scheme.
A 1024x1024 mandelbrot render with 250 max iterations takes approximately 23 milliseconds - depending of course on what part you are looking at and your GPU (GTX260 in my case) etc etc, but 23 milliseconds is so incredibly fast that none of the details really matter here. Copying the result back to normal ram took an other 20 milliseconds which is pretty much instantaneously.
These kinds of speed won't be beaten by your average CPU.
Compiling the OpenCL code takes a while, but this only needs to be done once per session.
The wrapper is written in pure C# and works in both 32bit and 64bit modes (although it currently only supports nvidia GPU's and requires their OpenCL capable drivers), and will be released into the public domain as soon as I feel comfortable with it (ironed out some bugs and added some more functionality etc) for now you can link to the binary in your project (but you really shouldn't as it is not finished) and edit the source as long as you don't make it available (but honestly, what's going to happen if you do it anyway?)
The reason I'm posting here is that this wrapper may be interesting to plugin writers, at least until Paint.NET exposes something like DirectCompute/OpenCL itself (any plans, Rick?)
In the mean time you can all get a copy of the temporary wrapper + test project (mandelbrot) here (mediafire) (too big to upload it to the forum)
And yes, most of the error codes are translated to throw new Exception() which sucks, it is not finished.
ps: sorry if this is the wrong forum
small status update: it looks to me like there is a bug in clEnqueueNDRangeKernel when called with a dimension of 2 (haven't tested 3), it throws a division by zero exception seemingly for no reason (I gave it reasonable arguments), so for now I'm cheating and using code like
int id = get_global_id(0);
int ix = id % pxwidth;
int iy = id / pxwidth;
Which works fine but probably costs some extra time (well that's an extra % and / you normally wouldn't have, and since their argument is not a constant they can not be free)
It looks like a hack and that's exactly what it is, but until I find out where that odd division by zero is coming from you/I won't be able to use multidimensional kernels.