Jump to content

Bruce Bowyer-Smyth

Members
  • Posts

    182
  • Joined

  • Last visited

Posts posted by Bruce Bowyer-Smyth

  1. Yes I would have thought so too but it seems to be capped between about 2500% and 3500%. I've got some kind of bottle neck going on there but thats a good start.

    Didn't bother with the quality setting and just used the equivalent of the high-quality "5" option in Paint.net's radial blur. The algorithms are not the same but the final image is pretty close.

    New version posted, includes the zoom blur as well.

  2. I would encourage you to figure out how to use SlimDX without having to install it, however.

    I believe that is possible by just having it in the effect folder and a separate redist for x86 and x64. If that is the case I will probably create a custom build with just the dx11 parts and a new namespace to avoid conflicts.

    I haven't added support for Direct3D simply because it's enormous

    No doubt. However only a small part of it would be relevant to image post processing. Unlike D2D where you would want most of it, an additive approach to D3D might work better where support is added on demand. At least it can set a baseline and avoid version conflicts between plugins. Maybe compute and pixel shaders and the supporting functions around that. Drawing of primitives probably would want to stay in the realm of 2D.

    Maybe that could be handled by the plugin manager? Only downloads D3D support when a plugin needs it.

  3. Slow would be a generous term I would say. It is promoted as being optimized for correctness rather than speed but I don’t know how you could use it as a reference with that performance. I guess that’s why WARP was developed for the rendering side of things.

    This is fine for a specialised plugin like this one but like you said if you wanted to use it for core functionality you would probably need a C# fallback version. The viability of that would depend on how often those things needed to be modified I suppose and how much performance benefit you could get.

    That can be an impressive amount in some ideal situations. Playing around with the thread count and the MaximumRegionWidth now has the large image processing at 22,098% faster.

    Starting to look at packaging and was wondering about the V4 plugin manager. Will it handle prerequisites/dependencies? If not will it be possible to run an msi?

  4. OK a bit of a tangent on the whole batching thing. DirectX has some multithreading built in through the use of device contexts. There is an immediate context and deferred contexts. Deferred contexts are designed for recording actions and resource creation during a game’s cut scene that can later be played back on the immediate context. The immediate context is the only one that actually executes the work and there can only be one executing at a time. So that method is out for this situation.

    CUDA has the ability to read the result data while the next dispatch is executing but I couldn’t find any reference for that functionality under DirectCompute. Even so after checking the effect out under the Visual Studio profiler nearly all of the time is spent on the GPU and the read back is a small portion of that so there wouldn’t be much additional gain.

    So instead I have worked away on some additional effects to the motion blur. In the belief that you haven’t really created something generic until you have used it at least three times, here are the other two:

    Gaussian Blur: Purely because a lot of other blurs are based on this one. The PDN standard effect is already pretty fast but the GPU certainly pulls away on the larger images/radius. This one presented a few problems as I had to simulate a multiple (dual) pass effect using the standard single pass effect class. I have moved all of the workaround code into a base class so the individual effects are clean but still it is not ideal as only the last pass will show a progress bar.

    Channel Blur: A separate and unique Gaussian blur for each color channel (bgra) with the ability to control the radius for each one. Uses the same dual pass base class as the Gaussian.

    Both have the ability to blur Horizontal and Vertical, Horizontal Only or Vertical Only. There is also an option to control edge behaviour.

    In terms of the hlsl I am now using compile macros. It does make the hlsl less clean but means that I can compile a couple of different optimal shaders from the same source and just load the best one at runtime. Very useful for the edge behaviour option which introduces different execution paths and when your inner loop executes 9 Billion* times every operation counts.

    Using DirectCompute might create different requirements than most but I would like to put forward the following additions, or their equivalent, for consideration in the V4 effect remix (too late?) that I had to implement. Some have already been mentioned elsewhere.

    Events:

    OnBeginRender(): Most of the code that would go here currently has to go in the OnSetRenderInfo event which looks out of place once you have a bit of it.

    OnBeginPass(): Takes a pass number and the source and destination args. I use this one for loading the correct shader for the particular pass (horizontal or vertical). Also to copy the source image to the buffer as it changes between passes.

    OnPassCompleted() and OnRenderCompleted(): Didn’t implement these but they would be handy for resource clean up and profiling.

    Properties:

    MaxRegionWidth and MaxRegionHeight: To control region slice sizes for working within resource limits.

    Passes: Number of passes a multiple pass effect has. Would need to be changeable up until the render starts. For example if only the vertical blur option was selected only 1 pass would be needed.

    Effect and source have been updated.

    Cheers

    *Motion blur at 25 degrees and 200 distance creates 301 sample points per pixel. 301 x 4800 x 6400

  5. New version available. Includes fix for the width overrun that Simon found and the performance improvements for the image update.

    With the source image copy and destination image update code modified the 4800 x 6400 test image is processed in 5617ms which is now a total of 17,838% faster.

    Onwards to investigate Rick's suggestion of batching before the OnRender call.

  6. Forgot to mention another lesson learned was that the fxc.exe compiler can only parse hlsl files saved as ASCII. If you try to pass it a Unicode shader text file, which is what Visual Studio creates by default, the compilation will error out with the informative message:

    "error X3501: 'CSMain': entrypoint not found".

    Where 'CSMain' is the name of your main function.

  7. Where are you seeing ColorBgra[] ?

    That is what the result buffer gives me to work with. There are two versions of reading a range "T[] ReadRange<T>(int count)" and "int ReadRange<T>(T[] buffer, int offset, int count)". I was seeing if I could call the second overload to get the buffer to update the image row directly as it potentially had the least overhead but I can't see a way to achieve this.

    Given that ColorBgra[] is my starting point what is the most efficient way to update the destination image? Given that only part of a row may need updating due to the selection rectangle. Is it still your previous suggestion?

  8. Thanks for all the feedback. I have started with Pyrochild's suggestion as I missed all those conversion methods off the ColorBgra struct and the image copy is important to the overall technique. Based on this I found that the image is already in memory in a form that I can pass directly to the shader without any conversions. So the C# packing code is gone and I am copying a whole row at a time to the buffer. The hlsl has been updated to match the format ColorBrga packs its data.

    Couldn’t do exactly the same reading the data back unless someone knows how to convert a ColorBrga* to a ColorBrga[].

    Improvements all round with the main one being the 4800 x 6400 image with an additional 1,500% performance boost.

    You can download it again to get the updated effect dll and source.

    Simon: I fixed one divide by zero error that was producing something like that. See if the latest version fixes it for you.

    In terms of deploying support for GPU effects SlimDX supports a custom build scenario where you can strip out what you don’t want and deploy your own assembly with your app. Of course that means adding new dependency which is not to be taken lightly when you are deploying desktop apps.

    The reference driver performance is woeful at best though it is not really designed to be used in production. What we really need is for Microsoft (or whoever) to release something like WARP for DirectCompute. A 20% drop compared to the CPU version would be fine for me as it really is just a fall back. Although you know you are at a tipping point when even IE9 will be GPU accelerated.

    Add the following to the appSettings section of the config file in the new version if you really want to test the reference driver.

    <add key="UseReference" value="1" />

  9. Dev Notes

    Due to the newness of DirectCompute there are very few resources on the web. Most of these are for Darth C++ as you would expect, with just a couple showing how to use it in .net. So hopefully this example may help people out although I have no prior experience with DirectX so it may not follow “best practices” yet.

    Development Prerequisites

    Lessons Learned

    1. Having an external rendering framework doesn’t fit exactly in the existing Paint.net effect model so there are a couple of things needed to be done if you are not writing a CPU effect. The first is to set EffectFlag.SingleThreaded on the constructor which essentially says I want to manage threading myself (on the GPU) and not be CPU threaded. The second is to setup our framework and anything needed across render calls in the OnSetRenderInfo method (Is there a better way to do this?).
    2. HLSL (High Level Shader Level) constants must be multiples of 16 bytes. You can either pack variables or add padding. See the Constants struct. Constants can be used to pass your configurable effect parameters into the shader.
    3. Timeout Detection and Recovery (TDR) is a Windows Vista/7 device driver feature that prevents them from locking up your system if they freeze. If a display driver doesn’t respond in 2 seconds it will be restarted with a message like “Display driver has stopped responding and has successfully recovered”. I initially thought the way PDN sliced images for multiple render calls would be the Achilles heel of this solution but it actually turned out to be its saviour, keeping each batch well below 2 seconds. Though don’t be surprised it hit this problem before you get to make improvements to your code. TDR can be disabled through the registry but it is not advisable to do so.
    4. SlimDX is a thin wrapper over DirectX11 and many other Windows technologies. Just about every object it creates in this solution is an unmanaged one so they all need to be tracked and disposed of in a timely manner.
    5. .net types map pretty well into hlsl but it has a limited type set. Mainly floats and ints are used. There is no byte so I was originally converting ColorBgra (struct of 4 bytes) into a float4 (struct of 4 floats) but the memory use was too large. I am now packing the 4 bytes into 1 int. Didn’t measure the speed beforehand but it actually seems a little quicker as there is a lot less information to copy and retrieve even with the pack/unpack overhead.
    6. Compute shader resources are all about buffers and views. Normally you create a buffer with data you want to pass and then create a view of that buffer.
    7. Debugging is difficult. As with most new tech the tools to build come first and then the tools to debug are refined later. Both AMD and NVIDIA are producing their own tools for this purpose. I have signed up to the NVIDIA Parallel Nsight beta which is an addin to Visual Studio. Just downloaded it so haven’t had a chance to use it yet but it should be a lot better than what I was doing before which was to set pixels to certain colors based on a condition I wanted to check.
    8. Hlsl is compiled with fxc.exe that comes with the DX SDK. See the compile.cmd file for the syntax. You can also compile at runtime from the hlsl file.

    ComputeShaderEffectsSource.zip

    Feel free to use this code to create your own effect if you want. Just remember to set the build action of your fx file to “Embed Resource”

    Interested to hear of any improvements or suggestions from those in the know.

  10. Edit: This effect has now been published here. If you just want to use it get it from there.

    GPU based effects and comparisons to CPU

    This started out as a bit of a personal research project but I wanted to share the code and get some opinions.

    There has been a lot of talk about using the GPU for general purpose computation (GPGPU) so I wanted to see if Paint.net effects could benefit from this technique. Using DirectCompute and compute shaders could they out perform a CPU by enough of a margin to make dealing with the extra dependencies worthwhile. Short answer is yes, yes they can.

    I ran a few tests on my middle aged computer to confirm with the standard Motion Blur effect. The GPU version produces the same image (Slight color variations due to rounding differences)

    Intel Core 2 Duo E6400

    VS

    NVIDIA 8800 GTS 320MB

    Test image: 960 x 1280 photo 72dpi

    Effect Settings: Motion Blur, Direction = 25.00, Centered = ticked

    Blur Distance  CPU (Approx Times)  GPU          Speed Increase (Approx)
    10	       2400ms	           311ms	  671%
    50	       10600ms	           348ms	2,945%
    100	       20100ms	           398ms	4,950%
    200	       38000ms	           498ms	7,530%
    

    It is interesting that even with the overhead of having to copy the entire image to be used by the video card even the smallest computation is notably faster.

    Let’s up the image size a bit.

    Test image: Resized 400%, 3840 x 5120

    Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
    200	        10min 42.7s	        5567ms	11,444%
    

    More?

    Test image: Resized 500%, 4800 x 6400

    Blur Distance	CPU (Approx Times)	GPU	Speed Increase (Approx)
    200	        16min 47.6s	        8281ms	12,067%
    

    Well that’s pretty impressive, seems the GPU loves the large data sets. Obviously this is a relative comparison and if I had a quad core the difference would be about half but even that is pretty good. As I am a gamer I aimed for the dual core and the 8800 to be fairly balanced when I bought them so that one wouldn’t be a bottle neck for the other. So I think this is a fair comparison.

    Anyways enough talk. Time for you to try. For now this is a manual install. Here is what you need:

    Prerequisites

    • Windows 7 or Windows Vista with the DirectX 11 platform update (x86, x64).
    • SlimDX Runtime (February 2010)
    • Latest Video Drivers. Direct Compute support hasn’t been around for long so you will need to update your video drivers to get it. Download GPU-Z and confirm that the DirectCompute checkbox is checked. If it isn’t you either have an unsupported video card or don’t have the latest drivers. The GPU effect will fall back to the reference driver (software) if an unsupported device is found which is incredibly slow.

    I haven’t got an AMD/ATI card to try it out so I would be interested to hear if all is well on those cards plus how a newer NVIDIA card performs.

    Extract this zip file into the Paint.net effects folder.

    ComputeShaderEffects.zip

    If you want to see the render time, drop this config file into the effects folder along with the other dll. It will show a message box when the render is complete when processing a full image selection. Remove the config file when you are done.

    ComputeShaderEffectsConfig.zip

    Known issues: Current getting an Out of Memory exception well before using the available video card memory. Haven’t investigated this one yet. I know I need enough memory for the image and the output buffer but it seems well short of it.

  11. WMF (Windows Metafile) file import plugin.

    Download WMF Paint.NET Plugin

    This plugin interprets the WMF record structure using 100% .NET code and uses WPF as the rendering engine. The main code behind this is the WMF2WPF library and with this plugin Paint.NET now has the ability to open metafiles.

    Note this does not turn Paint.NET into a vector editor. WMF files will open just as they do in MS Paint, as a raster image, but will typically look nicer using this plugin due to anti-aliasing. Images must be saved as a different file format like png for example. I have made it so that a metafile will open with two layers, a white background and a second layer with the main image. Hide the background layer to use the image with transparency (if the image has it).

    Cheers

×
×
  • Create New...