Jump to content

GPU Motion Blur effect using DirectCompute

Recommended Posts

Edit: This effect has now been published here. If you just want to use it get it from there.

GPU based effects and comparisons to CPU

This started out as a bit of a personal research project but I wanted to share the code and get some opinions.

There has been a lot of talk about using the GPU for general purpose computation (GPGPU) so I wanted to see if Paint.net effects could benefit from this technique. Using DirectCompute and compute shaders could they out perform a CPU by enough of a margin to make dealing with the extra dependencies worthwhile. Short answer is yes, yes they can.

I ran a few tests on my middle aged computer to confirm with the standard Motion Blur effect. The GPU version produces the same image (Slight color variations due to rounding differences)

Intel Core 2 Duo E6400



Test image: 960 x 1280 photo 72dpi

Effect Settings: Motion Blur, Direction = 25.00, Centered = ticked

Blur Distance  CPU (Approx Times)  GPU          Speed Increase (Approx)
10	       2400ms	           311ms	  671%
50	       10600ms	           348ms	2,945%
100	       20100ms	           398ms	4,950%
200	       38000ms	           498ms	7,530%

It is interesting that even with the overhead of having to copy the entire image to be used by the video card even the smallest computation is notably faster.

Let’s up the image size a bit.

Test image: Resized 400%, 3840 x 5120

Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
200	        10min 42.7s	        5567ms	11,444%


Test image: Resized 500%, 4800 x 6400

Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
200	        16min 47.6s	        8281ms	12,067%

Well that’s pretty impressive, seems the GPU loves the large data sets. Obviously this is a relative comparison and if I had a quad core the difference would be about half but even that is pretty good. As I am a gamer I aimed for the dual core and the 8800 to be fairly balanced when I bought them so that one wouldn’t be a bottle neck for the other. So I think this is a fair comparison.

Anyways enough talk. Time for you to try. For now this is a manual install. Here is what you need:


  • Windows 7 or Windows Vista with the DirectX 11 platform update (x86, x64).
  • SlimDX Runtime (February 2010)
  • Latest Video Drivers. Direct Compute support hasn’t been around for long so you will need to update your video drivers to get it. Download GPU-Z and confirm that the DirectCompute checkbox is checked. If it isn’t you either have an unsupported video card or don’t have the latest drivers. The GPU effect will fall back to the reference driver (software) if an unsupported device is found which is incredibly slow.

I haven’t got an AMD/ATI card to try it out so I would be interested to hear if all is well on those cards plus how a newer NVIDIA card performs.

Extract this zip file into the Paint.net effects folder.


If you want to see the render time, drop this config file into the effects folder along with the other dll. It will show a message box when the render is complete when processing a full image selection. Remove the config file when you are done.


Known issues: Current getting an Out of Memory exception well before using the available video card memory. Haven’t investigated this one yet. I know I need enough memory for the image and the output buffer but it seems well short of it.

Edited by Bruce Bowyer-Smyth
Link to comment
Share on other sites

Dev Notes

Due to the newness of DirectCompute there are very few resources on the web. Most of these are for Darth C++ as you would expect, with just a couple showing how to use it in .net. So hopefully this example may help people out although I have no prior experience with DirectX so it may not follow “best practices” yet.

Development Prerequisites

Lessons Learned

  1. Having an external rendering framework doesn’t fit exactly in the existing Paint.net effect model so there are a couple of things needed to be done if you are not writing a CPU effect. The first is to set EffectFlag.SingleThreaded on the constructor which essentially says I want to manage threading myself (on the GPU) and not be CPU threaded. The second is to setup our framework and anything needed across render calls in the OnSetRenderInfo method (Is there a better way to do this?).
  2. HLSL (High Level Shader Level) constants must be multiples of 16 bytes. You can either pack variables or add padding. See the Constants struct. Constants can be used to pass your configurable effect parameters into the shader.
  3. Timeout Detection and Recovery (TDR) is a Windows Vista/7 device driver feature that prevents them from locking up your system if they freeze. If a display driver doesn’t respond in 2 seconds it will be restarted with a message like “Display driver has stopped responding and has successfully recovered”. I initially thought the way PDN sliced images for multiple render calls would be the Achilles heel of this solution but it actually turned out to be its saviour, keeping each batch well below 2 seconds. Though don’t be surprised it hit this problem before you get to make improvements to your code. TDR can be disabled through the registry but it is not advisable to do so.
  4. SlimDX is a thin wrapper over DirectX11 and many other Windows technologies. Just about every object it creates in this solution is an unmanaged one so they all need to be tracked and disposed of in a timely manner.
  5. .net types map pretty well into hlsl but it has a limited type set. Mainly floats and ints are used. There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead.
  6. Compute shader resources are all about buffers and views. Normally you create a buffer with data you want to pass and then create a view of that buffer.
  7. Debugging is difficult. As with most new tech the tools to build come first and then the tools to debug are refined later. Both AMD and NVIDIA are producing their own tools for this purpose. I have signed up to the NVIDIA Parallel Nsight beta which is an addin to Visual Studio. Just downloaded it so haven’t had a chance to use it yet but it should be a lot better than what I was doing before which was to set pixels to certain colors based on a condition I wanted to check.
  8. Hlsl is compiled with fxc.exe that comes with the DX SDK. See the compile.cmd file for the syntax. You can also compile at runtime from the hlsl file.


Feel free to use this code to create your own effect if you want. Just remember to set the build action of your fx file to “Embed Resource”

Interested to hear of any improvements or suggestions from those in the know.

Link to comment
Share on other sites

There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead.

You shouldn't need to pack it yourself. It's already available as an Int32 union via ColorBgra.Bgra. Or maybe it's UInt32. Don't remember.


ambigram signature by Kemaru

[i write plugins and stuff]

If you like a post, upvote it!

Link to comment
Share on other sites

Sweet. On my system* it runs at the same performance no matter what setting I choose for "distance", during with PaintDotNet.exe only shows a few % of CPU usage.

As for an optimization hint, you can treat the OnRender() call as "this region must be finished rendering by the time you return, after which you can't write to it anymore." Contrast this to, "you can only render to this region when I hand it to you." In other words, you are allowed to render to a region at any time between OnSetRenderInfo() and the completion of your OnRender() implementation that is told to render that region.

I believe Ed is using a trick in his Fast Blur such that he queues up and begins all rendering in OnSetRenderInfo() and then each OnRender() call simply waits for that region to be finished before it returns. (He is doing his own background/worker thread management.)

* Core i7 980x 4.0GHz with 12GB RAM, GeForce GTX 260 Core 216 with 896MB RAM

The Paint.NET Blog: https://blog.getpaint.net/

Donations are always appreciated! https://www.getpaint.net/donate.html


Link to comment
Share on other sites

I'll try this out and post the results!

Edit 1

Hey, pdn crashes immediately after the start!

Edit 2

Oh, forgot to install SlimDX and OH GOLLY IT WORKS

So.. now the results!

Processor: AMD Phenom X2 945 at 3.0 Ghz, not overclocked

GPU: Old Nvidia GeForce 8600 GT

The image I used is 3664*2748 px.

Paint.NET's Motion Blur let's the processor usage in this little widget you have go up to 100% and it lasts for round about 5-40 seconds, depending on the distance set.

The GPU motion blur is incredibly faster.

Distance 20: 1785 ms

Distance 40: 2143 ms

Distqance 150 (!): 4375 ms

That's just amazing. Great work there. Will this get implemented into v4.0 please?

Edited by Frontcannon
Link to comment
Share on other sites

Will this get implemented into v4.0 please?

Probably not for v4.0, but it is something I'm eyeing with hope and drool. The biggest hurdles right now are the sheer size of the interop code required for working with Direct3D 11, and the performance of software fallback.

The interop library can be partly code generated at least, but it's still enormous. It simply won't fit into the schedule for v4.0. For reference, the interop code in Paint.NET v4.0 for working with Direct2D, DirectWrite, and a subset of WIC, is currently 30,000 lines of code. All of Paint.NET v3.5 is about 200,000 lines of code.

As for software fallback, I don't have the resources to maintain two implementations for each effect (software/C# and hardware/HLSL). I have not been able to benchmark the DirectCompute reference software driver to see how it performs vs. a software/C# implementation of an effect. If it could run the above plugin at, say, 20% slower than the C# version then that'd probably be fine. But what if it were 1/100th the speed? Ouch. At least this plugin here gives me the opportunity to do that testing!

The Paint.NET Blog: https://blog.getpaint.net/

Donations are always appreciated! https://www.getpaint.net/donate.html


Link to comment
Share on other sites

Thanks for all the feedback. I have started with Pyrochild's suggestion as I missed all those conversion methods off the ColorBgra struct and the image copy is important to the overall technique. Based on this I found that the image is already in memory in a form that I can pass directly to the shader without any conversions. So the C# packing code is gone and I am copying a whole row at a time to the buffer. The hlsl has been updated to match the format ColorBrga packs its data.

Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[].

Improvements all round with the main one being the 4800 x 6400 image with an additional 1,500% performance boost.

You can download it again to get the updated effect dll and source.

Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you.

In terms of deploying support for GPU effects SlimDX supports a custom build scenario where you can strip out what you don’t want and deploy your own assembly with your app. Of course that means adding new dependency which is not to be taken lightly when you are deploying desktop apps.

The reference driver performance is woeful at best though it is not really designed to be used in production. What we really need is for Microsoft (or whoever) to release something like WARP for DirectCompute. A 20% drop compared to the CPU version would be fine for me as it really is just a fall back. Although you know you are at a tipping point when even IE9 will be GPU accelerated.

Add the following to the appSettings section of the config file in the new version if you really want to test the reference driver.

<add key="UseReference" value="1" />

Link to comment
Share on other sites

Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[].

Where are you seeing ColorBgra[] ?

Just use methods like GetRowAddress() and blt the data directly, just like you did on the other direction.

The Paint.NET Blog: https://blog.getpaint.net/

Donations are always appreciated! https://www.getpaint.net/donate.html


Link to comment
Share on other sites

Where are you seeing ColorBgra[] ?

That is what the result buffer gives me to work with. There are two versions of reading a range "T[] ReadRange<T>(int count)" and "int ReadRange<T>(T[] buffer, int offset, int count)". I was seeing if I could call the second overload to get the buffer to update the image row directly as it potentially had the least overhead but I can't see a way to achieve this.

Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image? Given that only part of a row may need updating due to the selection rectangle. Is it still your previous suggestion?

Link to comment
Share on other sites

Forgot to mention another lesson learned was that the fxc.exe compiler can only parse hlsl files saved as ASCII. If you try to pass it a Unicode shader text file, which is what Visual Studio creates by default, the compilation will error out with the informative message:

"error X3501: 'CSMain': entrypoint not found".

Where 'CSMain' is the name of your main function.

Link to comment
Share on other sites

Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image?

Probably something like this.

ColorBgra[] srcPixels = ... however you're getting this right now ... ;
ColorBgra *pDstPixels = ... surface.GetPointAddress(roi.Left, roi.Top);
fixed (ColorBgra *pSrcPixels = srcPixels)
IntPtr pDstPixels2 = (IntPtr)pDstPixels;
IntPTr pSrcPixels2 = (IntPtr)pSrcPixels;

... now use either System.Runtime.InteropServices.Marshal.Copy() [i](which probably just calls into memcpy)[/i] ...
... or PaintDotNet.SystemLayer.Memory.Copy() [i](which definitely calls into memcpy which is optimized for SSE2 etc.) ...[/i]

Don't use the anything in the Memory class other than Copy and SetToZero, however.

So, you can kick off all the rendering in OnSetRenderInfo(). Then, in OnRender() you wait for that specific region to finish rendering, then blit it to the buffer that DstArgs references.

The Paint.NET Blog: https://blog.getpaint.net/

Donations are always appreciated! https://www.getpaint.net/donate.html


Link to comment
Share on other sites

New version available. Includes fix for the width overrun that Simon found and the performance improvements for the image update.

With the source image copy and destination image update code modified the 4800 x 6400 test image is processed in 5617ms which is now a total of 17,838% faster.

Onwards to investigate Rick's suggestion of batching before the OnRender call.

Link to comment
Share on other sites

This does indeed work well on Radeon graphics cards. When comparing between the processor and Radeon on a 5000x3750 image, the CPU took 4 minutes 58 seconds, while the GPU took 1759 miliseconds. That's a substantial improvement. 


Phenom II 955 @ 3.2 Ghz (stock)

Radeon 4850, also stock

Link to comment
Share on other sites

  • 2 weeks later...

OK a bit of a tangent on the whole batching thing. DirectX has some multithreading built in through the use of device contexts. There is an immediate context and deferred contexts. Deferred contexts are designed for recording actions and resource creation during a game’s cut scene that can later be played back on the immediate context. The immediate context is the only one that actually executes the work and there can only be one executing at a time. So that method is out for this situation.

CUDA has the ability to read the result data while the next dispatch is executing but I couldn’t find any reference for that functionality under DirectCompute. Even so after checking the effect out under the Visual Studio profiler nearly all of the time is spent on the GPU and the read back is a small portion of that so there wouldn’t be much additional gain.

So instead I have worked away on some additional effects to the motion blur. In the belief that you haven’t really created something generic until you have used it at least three times, here are the other two:

Gaussian Blur: Purely because a lot of other blurs are based on this one. The PDN standard effect is already pretty fast but the GPU certainly pulls away on the larger images/radius. This one presented a few problems as I had to simulate a multiple (dual) pass effect using the standard single pass effect class. I have moved all of the workaround code into a base class so the individual effects are clean but still it is not ideal as only the last pass will show a progress bar.

Channel Blur: A separate and unique Gaussian blur for each color channel (bgra) with the ability to control the radius for each one. Uses the same dual pass base class as the Gaussian.

Both have the ability to blur Horizontal and Vertical, Horizontal Only or Vertical Only. There is also an option to control edge behaviour.

In terms of the hlsl I am now using compile macros. It does make the hlsl less clean but means that I can compile a couple of different optimal shaders from the same source and just load the best one at runtime. Very useful for the edge behaviour option which introduces different execution paths and when your inner loop executes 9 Billion* times every operation counts.

Using DirectCompute might create different requirements than most but I would like to put forward the following additions, or their equivalent, for consideration in the V4 effect remix (too late?) that I had to implement. Some have already been mentioned elsewhere.


OnBeginRender(): Most of the code that would go here currently has to go in the OnSetRenderInfo event which looks out of place once you have a bit of it.

OnBeginPass(): Takes a pass number and the source and destination args. I use this one for loading the correct shader for the particular pass (horizontal or vertical). Also to copy the source image to the buffer as it changes between passes.

OnPassCompleted() and OnRenderCompleted(): Didn’t implement these but they would be handy for resource clean up and profiling.


MaxRegionWidth and MaxRegionHeight: To control region slice sizes for working within resource limits.

Passes: Number of passes a multiple pass effect has. Would need to be changeable up until the render starts. For example if only the vertical blur option was selected only 1 pass would be needed.

Effect and source have been updated.


*Motion blur at 25 degrees and 200 distance creates 301 sample points per pixel. 301 x 4800 x 6400

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...