Bruce Bowyer-Smyth Posted May 22, 2010 Share Posted May 22, 2010 (edited) Edit: This effect has now been published here. If you just want to use it get it from there. GPU based effects and comparisons to CPU This started out as a bit of a personal research project but I wanted to share the code and get some opinions. There has been a lot of talk about using the GPU for general purpose computation (GPGPU) so I wanted to see if Paint.net effects could benefit from this technique. Using DirectCompute and compute shaders could they out perform a CPU by enough of a margin to make dealing with the extra dependencies worthwhile. Short answer is yes, yes they can. I ran a few tests on my middle aged computer to confirm with the standard Motion Blur effect. The GPU version produces the same image (Slight color variations due to rounding differences) Intel Core 2 Duo E6400 VS NVIDIA 8800 GTS 320MB Test image: 960 x 1280 photo 72dpi Effect Settings: Motion Blur, Direction = 25.00, Centered = ticked Blur Distance CPU (Approx Times) GPU Speed Increase (Approx) 10 2400ms 311ms 671% 50 10600ms 348ms 2,945% 100 20100ms 398ms 4,950% 200 38000ms 498ms 7,530% It is interesting that even with the overhead of having to copy the entire image to be used by the video card even the smallest computation is notably faster. Let’s up the image size a bit. Test image: Resized 400%, 3840 x 5120 Blur Distance CPU (Approx Times) GPU Speed Increase (Approx) 200 10min 42.7s 5567ms 11,444% More? Test image: Resized 500%, 4800 x 6400 Blur Distance CPU (Approx Times) GPU Speed Increase (Approx) 200 16min 47.6s 8281ms 12,067% Well that’s pretty impressive, seems the GPU loves the large data sets. Obviously this is a relative comparison and if I had a quad core the difference would be about half but even that is pretty good. As I am a gamer I aimed for the dual core and the 8800 to be fairly balanced when I bought them so that one wouldn’t be a bottle neck for the other. So I think this is a fair comparison. Anyways enough talk. Time for you to try. For now this is a manual install. Here is what you need: Prerequisites Windows 7 or Windows Vista with the DirectX 11 platform update (x86, x64). SlimDX Runtime (February 2010) Latest Video Drivers. Direct Compute support hasn’t been around for long so you will need to update your video drivers to get it. Download GPU-Z and confirm that the DirectCompute checkbox is checked. If it isn’t you either have an unsupported video card or don’t have the latest drivers. The GPU effect will fall back to the reference driver (software) if an unsupported device is found which is incredibly slow. I haven’t got an AMD/ATI card to try it out so I would be interested to hear if all is well on those cards plus how a newer NVIDIA card performs. Extract this zip file into the Paint.net effects folder. ComputeShaderEffects.zip If you want to see the render time, drop this config file into the effects folder along with the other dll. It will show a message box when the render is complete when processing a full image selection. Remove the config file when you are done. ComputeShaderEffectsConfig.zip Known issues: Current getting an Out of Memory exception well before using the available video card memory. Haven’t investigated this one yet. I know I need enough memory for the image and the output buffer but it seems well short of it. Edited July 10, 2010 by Bruce Bowyer-Smyth Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 22, 2010 Author Share Posted May 22, 2010 Dev Notes Due to the newness of DirectCompute there are very few resources on the web. Most of these are for Darth C++ as you would expect, with just a couple showing how to use it in .net. So hopefully this example may help people out although I have no prior experience with DirectX so it may not follow “best practices” yet. Development Prerequisites Windows 7 or Windows Vista with the DirectX 11 platform update (x86, x64). SlimDX SDK (February 2010) DirectX 11 SDK (February 2010) Visual Studio 2010 (although you can probably hack the project files to run it on earlier versions) Paint.net (of course) Lessons Learned Having an external rendering framework doesn’t fit exactly in the existing Paint.net effect model so there are a couple of things needed to be done if you are not writing a CPU effect. The first is to set EffectFlag.SingleThreaded on the constructor which essentially says I want to manage threading myself (on the GPU) and not be CPU threaded. The second is to setup our framework and anything needed across render calls in the OnSetRenderInfo method (Is there a better way to do this?). HLSL (High Level Shader Level) constants must be multiples of 16 bytes. You can either pack variables or add padding. See the Constants struct. Constants can be used to pass your configurable effect parameters into the shader. Timeout Detection and Recovery (TDR) is a Windows Vista/7 device driver feature that prevents them from locking up your system if they freeze. If a display driver doesn’t respond in 2 seconds it will be restarted with a message like “Display driver has stopped responding and has successfully recovered”. I initially thought the way PDN sliced images for multiple render calls would be the Achilles heel of this solution but it actually turned out to be its saviour, keeping each batch well below 2 seconds. Though don’t be surprised it hit this problem before you get to make improvements to your code. TDR can be disabled through the registry but it is not advisable to do so. SlimDX is a thin wrapper over DirectX11 and many other Windows technologies. Just about every object it creates in this solution is an unmanaged one so they all need to be tracked and disposed of in a timely manner. .net types map pretty well into hlsl but it has a limited type set. Mainly floats and ints are used. There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead. Compute shader resources are all about buffers and views. Normally you create a buffer with data you want to pass and then create a view of that buffer. Debugging is difficult. As with most new tech the tools to build come first and then the tools to debug are refined later. Both AMD and NVIDIA are producing their own tools for this purpose. I have signed up to the NVIDIA Parallel Nsight beta which is an addin to Visual Studio. Just downloaded it so haven’t had a chance to use it yet but it should be a lot better than what I was doing before which was to set pixels to certain colors based on a condition I wanted to check. Hlsl is compiled with fxc.exe that comes with the DX SDK. See the compile.cmd file for the syntax. You can also compile at runtime from the hlsl file. ComputeShaderEffectsSource.zip Feel free to use this code to create your own effect if you want. Just remember to set the build action of your fx file to “Embed Resource” Interested to hear of any improvements or suggestions from those in the know. Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
pyrochild Posted May 22, 2010 Share Posted May 22, 2010 There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead. You shouldn't need to pack it yourself. It's already available as an Int32 union via ColorBgra.Bgra. Or maybe it's UInt32. Don't remember. Quote ambigram signature by Kemaru [i write plugins and stuff] If you like a post, upvote it! Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 22, 2010 Share Posted May 22, 2010 Sweet. On my system* it runs at the same performance no matter what setting I choose for "distance", during with PaintDotNet.exe only shows a few % of CPU usage. As for an optimization hint, you can treat the OnRender() call as "this region must be finished rendering by the time you return, after which you can't write to it anymore." Contrast this to, "you can only render to this region when I hand it to you." In other words, you are allowed to render to a region at any time between OnSetRenderInfo() and the completion of your OnRender() implementation that is told to render that region. I believe Ed is using a trick in his Fast Blur such that he queues up and begins all rendering in OnSetRenderInfo() and then each OnRender() call simply waits for that region to be finished before it returns. (He is doing his own background/worker thread management.) * Core i7 980x 4.0GHz with 12GB RAM, GeForce GTX 260 Core 216 with 896MB RAM Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Frontcannon Posted May 22, 2010 Share Posted May 22, 2010 (edited) I'll try this out and post the results! Edit 1 Hey, pdn crashes immediately after the start! Edit 2 Oh, forgot to install SlimDX and OH GOLLY IT WORKS So.. now the results! Processor: AMD Phenom X2 945 at 3.0 Ghz, not overclocked GPU: Old Nvidia GeForce 8600 GT The image I used is 3664*2748 px. Paint.NET's Motion Blur let's the processor usage in this little widget you have go up to 100% and it lasts for round about 5-40 seconds, depending on the distance set. The GPU motion blur is incredibly faster. Distance 20: 1785 ms Distance 40: 2143 ms Distqance 150 (!): 4375 ms That's just amazing. Great work there. Will this get implemented into v4.0 please? Edited May 22, 2010 by Frontcannon Quote Night Vision Text Effect Tutorial Gallery reddit.com/r/futurebeats | My Mixcloud Link to comment Share on other sites More sharing options...
csm725 Posted May 22, 2010 Share Posted May 22, 2010 Why isn't my DirectCompute box checked? I just got a new laptop ~1 month ago... Quote My deviantART | Sig Battles | My Tutorials | csm725.com Click to enter or vote in the official Paint.NET competitions! COMPETITIONS: LOGO OF THE WEEK Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 22, 2010 Share Posted May 22, 2010 Because Intel doesn't have DirectCompute support in their video drivers. Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
csm725 Posted May 22, 2010 Share Posted May 22, 2010 (edited) Oh man! I knew I should have gotten the ATI Radeon... Edited May 22, 2010 by csm725 Quote My deviantART | Sig Battles | My Tutorials | csm725.com Click to enter or vote in the official Paint.NET competitions! COMPETITIONS: LOGO OF THE WEEK Link to comment Share on other sites More sharing options...
Simon Brown Posted May 22, 2010 Share Posted May 22, 2010 I use a GeForce 9600M GS and it worked for me after installing a newer driver from NVidia rather than the OEM one. When I run the plugin I get a strange artifact at the edge, is it just me? Quote Link to comment Share on other sites More sharing options...
Sozo Posted May 22, 2010 Share Posted May 22, 2010 Would I have to do anything special to try this on my Radeon? Quote Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 22, 2010 Share Posted May 22, 2010 Dunno. You tell us! Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 22, 2010 Share Posted May 22, 2010 Will this get implemented into v4.0 please? Probably not for v4.0, but it is something I'm eyeing with hope and drool. The biggest hurdles right now are the sheer size of the interop code required for working with Direct3D 11, and the performance of software fallback. The interop library can be partly code generated at least, but it's still enormous. It simply won't fit into the schedule for v4.0. For reference, the interop code in Paint.NET v4.0 for working with Direct2D, DirectWrite, and a subset of WIC, is currently 30,000 lines of code. All of Paint.NET v3.5 is about 200,000 lines of code. As for software fallback, I don't have the resources to maintain two implementations for each effect (software/C# and hardware/HLSL). I have not been able to benchmark the DirectCompute reference software driver to see how it performs vs. a software/C# implementation of an effect. If it could run the above plugin at, say, 20% slower than the C# version then that'd probably be fine. But what if it were 1/100th the speed? Ouch. At least this plugin here gives me the opportunity to do that testing! Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 23, 2010 Author Share Posted May 23, 2010 Thanks for all the feedback. I have started with Pyrochild's suggestion as I missed all those conversion methods off the ColorBgra struct and the image copy is important to the overall technique. Based on this I found that the image is already in memory in a form that I can pass directly to the shader without any conversions. So the C# packing code is gone and I am copying a whole row at a time to the buffer. The hlsl has been updated to match the format ColorBrga packs its data. Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[]. Improvements all round with the main one being the 4800 x 6400 image with an additional 1,500% performance boost. You can download it again to get the updated effect dll and source. Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you. In terms of deploying support for GPU effects SlimDX supports a custom build scenario where you can strip out what you don’t want and deploy your own assembly with your app. Of course that means adding new dependency which is not to be taken lightly when you are deploying desktop apps. The reference driver performance is woeful at best though it is not really designed to be used in production. What we really need is for Microsoft (or whoever) to release something like WARP for DirectCompute. A 20% drop compared to the CPU version would be fine for me as it really is just a fall back. Although you know you are at a tipping point when even IE9 will be GPU accelerated. Add the following to the appSettings section of the config file in the new version if you really want to test the reference driver. <add key="UseReference" value="1" /> Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 23, 2010 Share Posted May 23, 2010 Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[]. Where are you seeing ColorBgra[] ? Just use methods like GetRowAddress() and blt the data directly, just like you did on the other direction. Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Simon Brown Posted May 23, 2010 Share Posted May 23, 2010 Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you. It doesn't. More detailed screenshots: Quote Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 23, 2010 Author Share Posted May 23, 2010 Simon: Yes it is happening with image with a width not cleanly divisible by 10. This is due to how I am breaking down the image for threading. Will be fixed in the next release I put out. Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Simon Brown Posted May 23, 2010 Share Posted May 23, 2010 Simon: Yes it is happening with image with a width not cleanly divisible by 10. Yes, if I change the width of the images it works. Thanks. Quote Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 23, 2010 Author Share Posted May 23, 2010 Where are you seeing ColorBgra[] ? That is what the result buffer gives me to work with. There are two versions of reading a range "T[] ReadRange<T>(int count)" and "int ReadRange<T>(T[] buffer, int offset, int count)". I was seeing if I could call the second overload to get the buffer to update the image row directly as it potentially had the least overhead but I can't see a way to achieve this. Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image? Given that only part of a row may need updating due to the selection rectangle. Is it still your previous suggestion? Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 23, 2010 Author Share Posted May 23, 2010 Forgot to mention another lesson learned was that the fxc.exe compiler can only parse hlsl files saved as ASCII. If you try to pass it a Unicode shader text file, which is what Visual Studio creates by default, the compilation will error out with the informative message: "error X3501: 'CSMain': entrypoint not found". Where 'CSMain' is the name of your main function. Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Rick Brewster Posted May 24, 2010 Share Posted May 24, 2010 Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image? Probably something like this. ColorBgra[] srcPixels = ... however you're getting this right now ... ; ColorBgra *pDstPixels = ... surface.GetPointAddress(roi.Left, roi.Top); fixed (ColorBgra *pSrcPixels = srcPixels) { IntPtr pDstPixels2 = (IntPtr)pDstPixels; IntPTr pSrcPixels2 = (IntPtr)pSrcPixels; ... now use either System.Runtime.InteropServices.Marshal.Copy() [i](which probably just calls into memcpy)[/i] ... ... or PaintDotNet.SystemLayer.Memory.Copy() [i](which definitely calls into memcpy which is optimized for SSE2 etc.) ...[/i] } Don't use the anything in the Memory class other than Copy and SetToZero, however. So, you can kick off all the rendering in OnSetRenderInfo(). Then, in OnRender() you wait for that specific region to finish rendering, then blit it to the buffer that DstArgs references. Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted May 25, 2010 Author Share Posted May 25, 2010 New version available. Includes fix for the width overrun that Simon found and the performance improvements for the image update. With the source image copy and destination image update code modified the 4800 x 6400 test image is processed in 5617ms which is now a total of 17,838% faster. Onwards to investigate Rick's suggestion of batching before the OnRender call. Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Sozo Posted May 25, 2010 Share Posted May 25, 2010 This does indeed work well on Radeon graphics cards. When comparing between the processor and Radeon on a 5000x3750 image, the CPU took 4 minutes 58 seconds, while the GPU took 1759 miliseconds. That's a substantial improvement. Hardware: Phenom II 955 @ 3.2 Ghz (stock) Radeon 4850, also stock Quote Link to comment Share on other sites More sharing options...
Bruce Bowyer-Smyth Posted June 5, 2010 Author Share Posted June 5, 2010 OK a bit of a tangent on the whole batching thing. DirectX has some multithreading built in through the use of device contexts. There is an immediate context and deferred contexts. Deferred contexts are designed for recording actions and resource creation during a game’s cut scene that can later be played back on the immediate context. The immediate context is the only one that actually executes the work and there can only be one executing at a time. So that method is out for this situation. CUDA has the ability to read the result data while the next dispatch is executing but I couldn’t find any reference for that functionality under DirectCompute. Even so after checking the effect out under the Visual Studio profiler nearly all of the time is spent on the GPU and the read back is a small portion of that so there wouldn’t be much additional gain. So instead I have worked away on some additional effects to the motion blur. In the belief that you haven’t really created something generic until you have used it at least three times, here are the other two: Gaussian Blur: Purely because a lot of other blurs are based on this one. The PDN standard effect is already pretty fast but the GPU certainly pulls away on the larger images/radius. This one presented a few problems as I had to simulate a multiple (dual) pass effect using the standard single pass effect class. I have moved all of the workaround code into a base class so the individual effects are clean but still it is not ideal as only the last pass will show a progress bar. Channel Blur: A separate and unique Gaussian blur for each color channel (bgra) with the ability to control the radius for each one. Uses the same dual pass base class as the Gaussian. Both have the ability to blur Horizontal and Vertical, Horizontal Only or Vertical Only. There is also an option to control edge behaviour. In terms of the hlsl I am now using compile macros. It does make the hlsl less clean but means that I can compile a couple of different optimal shaders from the same source and just load the best one at runtime. Very useful for the edge behaviour option which introduces different execution paths and when your inner loop executes 9 Billion* times every operation counts. Using DirectCompute might create different requirements than most but I would like to put forward the following additions, or their equivalent, for consideration in the V4 effect remix (too late?) that I had to implement. Some have already been mentioned elsewhere. Events: OnBeginRender(): Most of the code that would go here currently has to go in the OnSetRenderInfo event which looks out of place once you have a bit of it. OnBeginPass(): Takes a pass number and the source and destination args. I use this one for loading the correct shader for the particular pass (horizontal or vertical). Also to copy the source image to the buffer as it changes between passes. OnPassCompleted() and OnRenderCompleted(): Didn’t implement these but they would be handy for resource clean up and profiling. Properties: MaxRegionWidth and MaxRegionHeight: To control region slice sizes for working within resource limits. Passes: Number of passes a multiple pass effect has. Would need to be changeable up until the render starts. For example if only the vertical blur option was selected only 1 pass would be needed. Effect and source have been updated. Cheers *Motion blur at 25 degrees and 200 distance creates 301 sample points per pixel. 301 x 4800 x 6400 Quote GPU Blur Plugin | WMF File Plugin Link to comment Share on other sites More sharing options...
Frontcannon Posted June 5, 2010 Share Posted June 5, 2010 So Gaussian Blur and Channel Blur have now a GPU version? Quote Night Vision Text Effect Tutorial Gallery reddit.com/r/futurebeats | My Mixcloud Link to comment Share on other sites More sharing options...
Rick Brewster Posted June 5, 2010 Share Posted June 5, 2010 No, you read it wrong. It's actually Doom and Wolfenstein 3D that he ported. This has nothing to do with Paint.NET, move along ... Quote The Paint.NET Blog: https://blog.getpaint.net/ Donations are always appreciated! https://www.getpaint.net/donate.html Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.