Jump to content


Photo

GPU Motion Blur effect using DirectCompute


50 replies to this topic

#1 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 22 May 2010 - 05:01 AM

Edit: This effect has now been published here. If you just want to use it get it from there.

GPU based effects and comparisons to CPU

This started out as a bit of a personal research project but I wanted to share the code and get some opinions.

There has been a lot of talk about using the GPU for general purpose computation (GPGPU) so I wanted to see if Paint.net effects could benefit from this technique. Using DirectCompute and compute shaders could they out perform a CPU by enough of a margin to make dealing with the extra dependencies worthwhile. Short answer is yes, yes they can.

I ran a few tests on my middle aged computer to confirm with the standard Motion Blur effect. The GPU version produces the same image (Slight color variations due to rounding differences)

Intel Core 2 Duo E6400
VS
NVIDIA 8800 GTS 320MB


Test image: 960 x 1280 photo 72dpi
Effect Settings: Motion Blur, Direction = 25.00, Centered = ticked
Blur Distance  CPU (Approx Times)  GPU          Speed Increase (Approx)
10	       2400ms	           311ms	  671%
50	       10600ms	           348ms	2,945%
100	       20100ms	           398ms	4,950%
200	       38000ms	           498ms	7,530%

It is interesting that even with the overhead of having to copy the entire image to be used by the video card even the smallest computation is notably faster.

Let’s up the image size a bit.

Test image: Resized 400%, 3840 x 5120
Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
200	        10min 42.7s	        5567ms	11,444%

More?

Test image: Resized 500%, 4800 x 6400
Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
200	        16min 47.6s	        8281ms	12,067%

Well that’s pretty impressive, seems the GPU loves the large data sets. Obviously this is a relative comparison and if I had a quad core the difference would be about half but even that is pretty good. As I am a gamer I aimed for the dual core and the 8800 to be fairly balanced when I bought them so that one wouldn’t be a bottle neck for the other. So I think this is a fair comparison.

Anyways enough talk. Time for you to try. For now this is a manual install. Here is what you need:

Prerequisites
  • Windows 7 or Windows Vista with the DirectX 11 platform update (x86, x64).
  • SlimDX Runtime (February 2010)
  • Latest Video Drivers. Direct Compute support hasn’t been around for long so you will need to update your video drivers to get it. Download GPU-Z and confirm that the DirectCompute checkbox is checked. If it isn’t you either have an unsupported video card or don’t have the latest drivers. The GPU effect will fall back to the reference driver (software) if an unsupported device is found which is incredibly slow.

I haven’t got an AMD/ATI card to try it out so I would be interested to hear if all is well on those cards plus how a newer NVIDIA card performs.

Extract this zip file into the Paint.net effects folder.

ComputeShaderEffects.zip

If you want to see the render time, drop this config file into the effects folder along with the other dll. It will show a message box when the render is complete when processing a full image selection. Remove the config file when you are done.

ComputeShaderEffectsConfig.zip

Known issues: Current getting an Out of Memory exception well before using the available video card memory. Haven’t investigated this one yet. I know I need enough memory for the image and the output buffer but it seems well short of it.

Edited by Bruce Bowyer-Smyth, 10 July 2010 - 10:19 PM.

  • 0

#2 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 22 May 2010 - 05:02 AM

Dev Notes

Due to the newness of DirectCompute there are very few resources on the web. Most of these are for Darth C++ as you would expect, with just a couple showing how to use it in .net. So hopefully this example may help people out although I have no prior experience with DirectX so it may not follow “best practices” yet.

Development Prerequisites

Lessons Learned
  • Having an external rendering framework doesn’t fit exactly in the existing Paint.net effect model so there are a couple of things needed to be done if you are not writing a CPU effect. The first is to set EffectFlag.SingleThreaded on the constructor which essentially says I want to manage threading myself (on the GPU) and not be CPU threaded. The second is to setup our framework and anything needed across render calls in the OnSetRenderInfo method (Is there a better way to do this?).
  • HLSL (High Level Shader Level) constants must be multiples of 16 bytes. You can either pack variables or add padding. See the Constants struct. Constants can be used to pass your configurable effect parameters into the shader.
  • Timeout Detection and Recovery (TDR) is a Windows Vista/7 device driver feature that prevents them from locking up your system if they freeze. If a display driver doesn’t respond in 2 seconds it will be restarted with a message like “Display driver has stopped responding and has successfully recovered”. I initially thought the way PDN sliced images for multiple render calls would be the Achilles heel of this solution but it actually turned out to be its saviour, keeping each batch well below 2 seconds. Though don’t be surprised it hit this problem before you get to make improvements to your code. TDR can be disabled through the registry but it is not advisable to do so.
  • SlimDX is a thin wrapper over DirectX11 and many other Windows technologies. Just about every object it creates in this solution is an unmanaged one so they all need to be tracked and disposed of in a timely manner.
  • .net types map pretty well into hlsl but it has a limited type set. Mainly floats and ints are used. There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead.
  • Compute shader resources are all about buffers and views. Normally you create a buffer with data you want to pass and then create a view of that buffer.
  • Debugging is difficult. As with most new tech the tools to build come first and then the tools to debug are refined later. Both AMD and NVIDIA are producing their own tools for this purpose. I have signed up to the NVIDIA Parallel Nsight beta which is an addin to Visual Studio. Just downloaded it so haven’t had a chance to use it yet but it should be a lot better than what I was doing before which was to set pixels to certain colors based on a condition I wanted to check.
  • Hlsl is compiled with fxc.exe that comes with the DX SDK. See the compile.cmd file for the syntax. You can also compile at runtime from the hlsl file.

ComputeShaderEffectsSource.zip

Feel free to use this code to create your own effect if you want. Just remember to set the build action of your fx file to “Embed Resource”

Interested to hear of any improvements or suggestions from those in the know.
  • 0

#3 pyrochild

pyrochild
  • Administrators
  • 11,496 posts
  • LocationColorado
  • Reputation:205

Posted 22 May 2010 - 05:15 AM

There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead.

You shouldn't need to pack it yourself. It's already available as an Int32 union via ColorBgra.Bgra. Or maybe it's UInt32. Don't remember.
  • 0
xZYt6wl.png
ambigram signature by Kemaru

[I write plugins and stuff]

If you like a post, upvote it!

#4 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 22 May 2010 - 05:43 AM

Sweet. On my system* it runs at the same performance no matter what setting I choose for "distance", during with PaintDotNet.exe only shows a few % of CPU usage.

As for an optimization hint, you can treat the OnRender() call as "this region must be finished rendering by the time you return, after which you can't write to it anymore." Contrast this to, "you can only render to this region when I hand it to you." In other words, you are allowed to render to a region at any time between OnSetRenderInfo() and the completion of your OnRender() implementation that is told to render that region.

I believe Ed is using a trick in his Fast Blur such that he queues up and begins all rendering in OnSetRenderInfo() and then each OnRender() call simply waits for that region to be finished before it returns. (He is doing his own background/worker thread management.)

* Core i7 980x 4.0GHz with 12GB RAM, GeForce GTX 260 Core 216 with 896MB RAM
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image

#5 Frontcannon

Frontcannon
  • Members
  • 2,302 posts
  • LocationNorth-Rhine Westphalia, Germany
  • Reputation:4

Posted 22 May 2010 - 01:03 PM

I'll try this out and post the results!

Edit 1
Hey, pdn crashes immediately after the start!

Edit 2
Oh, forgot to install SlimDX and OH GOLLY IT WORKS

So.. now the results!

Processor: AMD Phenom X2 945 at 3.0 Ghz, not overclocked
GPU: Old Nvidia GeForce 8600 GT

The image I used is 3664*2748 px.

Paint.NET's Motion Blur let's the processor usage in this little widget you have go up to 100% and it lasts for round about 5-40 seconds, depending on the distance set.

The GPU motion blur is incredibly faster.
Distance 20: 1785 ms
Distance 40: 2143 ms
Distqance 150 (!): 4375 ms

That's just amazing. Great work there. Will this get implemented into v4.0 please?

Edited by Frontcannon, 22 May 2010 - 01:43 PM.

  • 0

#6 csm725

csm725
  • Competition Hosts
  • 2,176 posts
  • Locationcsm725.com
  • Reputation:7

Posted 22 May 2010 - 02:20 PM

Why isn't my DirectCompute box checked? I just got a new laptop ~1 month ago...

Posted Image


  • 0

#7 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 22 May 2010 - 07:27 PM

Because Intel doesn't have DirectCompute support in their video drivers.
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image

#8 csm725

csm725
  • Competition Hosts
  • 2,176 posts
  • Locationcsm725.com
  • Reputation:7

Posted 22 May 2010 - 08:54 PM

 Oh man! I knew I should have gotten the ATI Radeon... 

Edited by csm725, 22 May 2010 - 08:54 PM.

  • 0

#9 Simon Brown

Simon Brown
  • Members
  • 10,255 posts
  • Reputation:27

Posted 22 May 2010 - 09:27 PM

I use a GeForce 9600M GS and it worked for me after installing a newer driver from NVidia rather than the OEM one. When I run the plugin I get a strange artifact at the edge, is it just me?

Posted Image
  • 0
Posted Image

#10 Sozo

Sozo
  • Competition Hosts
  • 4,430 posts
  • LocationYe Olde Dominion
  • Reputation:18

Posted 22 May 2010 - 11:04 PM

Would I have to do anything special to try this on my Radeon?
  • 0

#11 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 22 May 2010 - 11:11 PM

Dunno. You tell us!
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image

#12 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 22 May 2010 - 11:18 PM

Will this get implemented into v4.0 please?

Probably not for v4.0, but it is something I'm eyeing with hope and drool. The biggest hurdles right now are the sheer size of the interop code required for working with Direct3D 11, and the performance of software fallback.

The interop library can be partly code generated at least, but it's still enormous. It simply won't fit into the schedule for v4.0. For reference, the interop code in Paint.NET v4.0 for working with Direct2D, DirectWrite, and a subset of WIC, is currently 30,000 lines of code. All of Paint.NET v3.5 is about 200,000 lines of code.

As for software fallback, I don't have the resources to maintain two implementations for each effect (software/C# and hardware/HLSL). I have not been able to benchmark the DirectCompute reference software driver to see how it performs vs. a software/C# implementation of an effect. If it could run the above plugin at, say, 20% slower than the C# version then that'd probably be fine. But what if it were 1/100th the speed? Ouch. At least this plugin here gives me the opportunity to do that testing!
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image

#13 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 23 May 2010 - 05:23 AM

Thanks for all the feedback. I have started with Pyrochild's suggestion as I missed all those conversion methods off the ColorBgra struct and the image copy is important to the overall technique. Based on this I found that the image is already in memory in a form that I can pass directly to the shader without any conversions. So the C# packing code is gone and I am copying a whole row at a time to the buffer. The hlsl has been updated to match the format ColorBrga packs its data.

Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[].

Improvements all round with the main one being the 4800 x 6400 image with an additional 1,500% performance boost.

You can download it again to get the updated effect dll and source.

Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you.


In terms of deploying support for GPU effects SlimDX supports a custom build scenario where you can strip out what you don’t want and deploy your own assembly with your app. Of course that means adding new dependency which is not to be taken lightly when you are deploying desktop apps.

The reference driver performance is woeful at best though it is not really designed to be used in production. What we really need is for Microsoft (or whoever) to release something like WARP for DirectCompute. A 20% drop compared to the CPU version would be fine for me as it really is just a fall back. Although you know you are at a tipping point when even IE9 will be GPU accelerated.

Add the following to the appSettings section of the config file in the new version if you really want to test the reference driver.
<add key="UseReference" value="1" />

  • 0

#14 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 23 May 2010 - 05:33 AM

Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[].

Where are you seeing ColorBgra[] ?

Just use methods like GetRowAddress() and blt the data directly, just like you did on the other direction.
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image

#15 Simon Brown

Simon Brown
  • Members
  • 10,255 posts
  • Reputation:27

Posted 23 May 2010 - 09:08 AM

Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you.


It doesn't. :(

More detailed screenshots:

Posted Image

Posted Image
  • 0
Posted Image

#16 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 23 May 2010 - 11:45 AM

Simon: Yes it is happening with image with a width not cleanly divisible by 10. This is due to how I am breaking down the image for threading. Will be fixed in the next release I put out.
  • 0

#17 Simon Brown

Simon Brown
  • Members
  • 10,255 posts
  • Reputation:27

Posted 23 May 2010 - 03:49 PM

Simon: Yes it is happening with image with a width not cleanly divisible by 10.


Yes, if I change the width of the images it works. Thanks.
  • 0
Posted Image

#18 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 23 May 2010 - 08:37 PM

Where are you seeing ColorBgra[] ?


That is what the result buffer gives me to work with. There are two versions of reading a range "T[] ReadRange<T>(int count)" and "int ReadRange<T>(T[] buffer, int offset, int count)". I was seeing if I could call the second overload to get the buffer to update the image row directly as it potentially had the least overhead but I can't see a way to achieve this.

Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image? Given that only part of a row may need updating due to the selection rectangle. Is it still your previous suggestion?
  • 0

#19 Bruce Bowyer-Smyth

Bruce Bowyer-Smyth
  • Members
  • 141 posts
  • LocationAustralia
  • Reputation:3

Posted 23 May 2010 - 08:48 PM

Forgot to mention another lesson learned was that the fxc.exe compiler can only parse hlsl files saved as ASCII. If you try to pass it a Unicode shader text file, which is what Visual Studio creates by default, the compilation will error out with the informative message:

"error X3501: 'CSMain': entrypoint not found".

Where 'CSMain' is the name of your main function.
  • 0

#20 Rick Brewster

Rick Brewster

    Paint.NET Author and Developer

  • Administrators
  • 13,579 posts
  • LocationKirkland, WA
  • Reputation:328

Posted 24 May 2010 - 12:19 AM

Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image?

Probably something like this.
ColorBgra[] srcPixels = ... however you're getting this right now ... ;
ColorBgra *pDstPixels = ... surface.GetPointAddress(roi.Left, roi.Top);
fixed (ColorBgra *pSrcPixels = srcPixels)
{
	IntPtr pDstPixels2 = (IntPtr)pDstPixels;
	IntPTr pSrcPixels2 = (IntPtr)pSrcPixels;

	... now use either System.Runtime.InteropServices.Marshal.Copy() [i](which probably just calls into memcpy)[/i] ...
	... or PaintDotNet.SystemLayer.Memory.Copy() [i](which definitely calls into memcpy which is optimized for SSE2 etc.) ...[/i]
}
Don't use the anything in the Memory class other than Copy and SetToZero, however.

So, you can kick off all the rendering in OnSetRenderInfo(). Then, in OnRender() you wait for that specific region to finish rendering, then blit it to the buffer that DstArgs references.
  • 0
The Paint.NET Blog: http://blog.getpaint.net/
Donations are always appreciated! http://www.getpaint.net/donate.html

Posted Image