GPU Median Filter

_koh_ · January 30

Was thinking about GPU median since I saw this thread and I think I've found a usable solution.
Basically a Hi-Lo based algorithm, kinda like B-tree and such.

[D2DInputCount(1), D2DInputSimple(0), D2DInputDescription(0, D2D1Filter.MinMagMipPoint), AutoConstructor]
private readonly partial struct Render : ID2D1PixelShader {
    private readonly float radius, percent;
    private readonly float2 delta;

    private float4 HiLo(float4 c) {
        float4 n = 0;
        float  m = 0;
        float2 o = 0, p = 0, q = 0;
        q.Y = radius;
        p.Y = q.Y % delta.Y - q.Y;
        for (o.Y = p.Y; o.Y <= q.Y; o.Y += delta.Y) {
            q.X = Hlsl.Trunc(Hlsl.Sqrt(q.Y * q.Y - o.Y * o.Y));
            p.X = Hlsl.Abs(q.X - o.Y) % delta.X - q.X;
            for (o.X = p.X; o.X <= q.X; o.X += delta.X) {
                float4 s = D2D.SampleInputAtOffset(0, o);
                n += Hlsl.Step(s, c);
                m += 1;
            }
        }
        return Hlsl.Sign(Hlsl.Max(m * percent, 1) - n * 100 - 0.5f);
    }

    public float4 Execute() {
        float4 c = 0.5f;
        float  d = 0.5f;
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        c += HiLo(c) * (d *= 0.5f);
        return c;
    }
}

On my RTX 3060 Laptop, performance of quarter sampling mode matches FIFO optimized CPU median at radius = 100, but some artifacts are visible in places.
On my IGPU, quarter sampling mode at radius = 50 runs ...okay. At least it doesn't take forever.
A bit difficult to judge which algorithm is usable because each GPU is too different in performance.

Full source code + dll. MedianFilterGPU.zip

Rick Brewster · January 30

I played around with this a bit and it's really interesting! It definitely has some performance problems, but I think that could be improved -- I believe each invocation of HiLo() could be put into its own node in the shader graph, which would help avoid bogging down the GPU scheduler (pre-emptive scheduling does not seem to be a thing) as it appears to need to fully execute a shader before it can task-switch out to something else. This "temporal separability" is a major advantage of the algorithm you've devised here. So the shader would take two inputs, the first being the source image and the second being the output of the previous shader. Then chain it 8 times.

Where did you come across this algorithm? Searching for "hi-lo algorithm" gives me a bunch of discussion about ... databases?

lynxster4 · January 30

@_koh_ I've been playing around with your plugin also. I think it's great. I'm getting some very nice 'watercolor' effects.

Then add Sharpen+, or Emboss/Relif+, or TGMagnitude and things really start to 'pop'. I think it should be in the 'Artistic' menu. Great job! 😊

_koh_ · January 31

8 hours ago, Rick Brewster said:

I believe each invocation of HiLo() could be put into its own node in the shader graph, which would help avoid bogging down the GPU scheduler (pre-emptive scheduling does not seem to be a thing) as it appears to need to fully execute a shader before it can task-switch out to something else.

So linked shader and shader function work differently? interesting.

I'm new to this so I was assuming everything gets inlined in the end.

9 hours ago, Rick Brewster said:

Where did you come across this algorithm? Searching for "hi-lo algorithm" gives me a bunch of discussion about ... databases?

This is binary search, and while I already had it in my toolbox, this is the first time I use it in this way. Like doing a lot of computing to decide which path to go.

And yeah, I'm mostly a database guy actually so maybe that affecting how I explore the ideas. haha

5 hours ago, lynxster4 said:

@_koh_ I've been playing around with your plugin also. I think it's great. I'm getting some very nice 'watercolor' effects.

Then add Sharpen+, or Emboss/Relif+, or TGMagnitude and things really start to 'pop'. I think it should be in the 'Artistic' menu. Great job! 😊

Thanks!

I'm just doing it's 'Color' if it's not 'Photo', so I'll move it there.

Rick Brewster · January 31

55 minutes ago, _koh_ said:

So linked shader and shader function work differently? interesting.

With shader linking, they should be equivalent (there are restrictions on this). But that's mostly a performance thing and I'm not sure that's what you mean.

What I'm saying is that you can think of a shader's simple inputs as being equivalent to function parameters, e.g. float4 Execute(float4 simpleInput1, float4 simpleInput2). (You can't write the code like that, but like I said it's conceptually equivalent/isomorphic.)

The shader's Execute() method would only call HiLo() once. The value for d would just be plugged in as a shader const (a private readonly field). Input 0 would stay the same and would be complex (D2DInputComplex), but Input 1 would be hooked up to the previous instance's output and would be simple (D2DInputSimple).

So instead of:

input -> Shader(call HiLo 8 times) = output

You'd have:

                 SourceImage              
                     |
                     +---------------------------+-------------...-------------+
                     |                           |                             |
                     v                           v                             v
Flood(0.5f) -> Shader(call HiLo once)-> Shader(call HiLo once) ... -> Shader(call HiLo once) = output
                     d=0.5                     d=0.25                       d=...

Flood is used to provide the initial value for c.

This would calculate the same thing, and it might even be slightly slower, but it would either eliminate or greatly reduce the lag imposed on the rest of the system because the GPU can "take a break" between each HiLo() call. Shaders can't be pre-emptively paused/resumed like CPU threads, IIUC, they must run to completion and can lock up the GPU or the whole system.

_koh_ · January 31

55 minutes ago, Rick Brewster said:

Not sure what you mean by that.

I was assuming

shader.SetInput(0, new EmptyEffect(DC))

then do

D2D.GetInput(0)

is exactly the same thing having

float4 EmptyEffect() => 0

in my shader then do

EmptyEffect()

after the shader linking. Not only results but how they run.

What you are suggesting is intentionally use D2DInputComplex to prevent shader linking and split them up?

edit:

I only have rough idea about how shader linking works, so likely my question being a bit off.

At least I understand two blur effects can't be linked.

Edited January 31 by _koh_

Rick Brewster · January 31

13 hours ago, _koh_ said:

What you are suggesting is intentionally use D2DInputComplex to prevent shader linking and split them up?

No -- shader linking can only link a simple shader to another simple shader. I was just linking to D2D's documentation as a side note.

Do you mind if I take this code and run with it? I might be able to turn it into a more fleshed out plugin, or even incorporate it into Paint.NET itself. I know @BoltBait has been asking me for a Median effect he can use in his plugins, and this might do the trick better than the median approximation algorithm in Median Sketch.

_koh_ · February 1

10 hours ago, Rick Brewster said:

Do you mind if I take this code and run with it?

Totally fine.

This is more like proof of concept and basically if I post any code anyone can do anything with it.

edit:

If you optimized this, please educate me how you did it.

I've tested this version, but I still have both input set simple so I believe those 8 shaders merged into 1 in the end. And shader function version runs 50%-ish faster than this one.

protected override IDeviceImage OnCreateOutput(PaintDotNet.Direct2D1.IDeviceContext DC) {
    var radius  = (int)Token.GetProperty(PropertyNames.Radius ).Value;
    var percent = (int)Token.GetProperty(PropertyNames.Percent).Value;
    var sample  = (int)Token.GetProperty(PropertyNames.Sample ).Value;

    var delta  = new Vector2[] {new(1, 1), new(2, 1), new(2, 2)}[sample];
    var mapper = D2D1TransformMapperFactory<Render>.Inflate(radius);
    var output = (IDeviceImage)new FloodEffect(DC, new(0.5f));
    for (var (ratio, i) = (0.5f, 0); i < 8; i++) {
        using var source = new BorderEffect(DC, Environment.SourceImage, BorderEdgeMode.Clamp);
        using var input  = output;
        output = Shader([source, input], new(ratio *= 0.5f, radius, percent, delta), [], mapper);
    }
    return output;
}

[D2DInputCount(2), D2DInputSimple(0), D2DInputSimple(1), D2DInputDescription(0, D2D1Filter.MinMagMipPoint), AutoConstructor]
private readonly partial struct Render : ID2D1PixelShader {
    private readonly float ratio, radius, percent;
    private readonly float2 delta;

    public float4 Execute() {
        float4 c = D2D.GetInput(1);
        float4 n = 0;
        float  m = 0;
        float2 o = 0, p = 0, q = 0;
        q.Y = radius;
        p.Y = q.Y % delta.Y - q.Y;
        for (o.Y = p.Y; o.Y <= q.Y; o.Y += delta.Y) {
            q.X = Hlsl.Trunc(Hlsl.Sqrt(q.Y * q.Y - o.Y * o.Y));
            p.X = Hlsl.Abs(q.X - o.Y) % delta.X - q.X;
            for (o.X = p.X; o.X <= q.X; o.X += delta.X) {
                float4 s = D2D.SampleInputAtOffset(0, o);
                n += Hlsl.Step(s, c);
                m += 1;
            }
        }
        return c + ratio * (float4)Hlsl.Sign(Hlsl.Max(m * percent, 1) - n * 100 - 0.5f);
    }
}

Edited February 1 by _koh_

Rick Brewster · February 1

On 1/30/2024 at 12:38 PM, lynxster4 said:

I've been playing around with your plugin also. I think it's great. I'm getting some very nice 'watercolor' effects.

Then add Sharpen+, or Emboss/Relif+, or TGMagnitude and things really start to 'pop'. I think it should be in the 'Artistic' menu. Great job! 😊

Did you know about Effects -> Noise -> Median? That's what this new plugin is replicating, but running it on the GPU The CPU version (the built-in Median) is actually faster, too -- as it turns out, doing a median calculation is very expensive to do on the GPU!

_koh_ · February 1

1 hour ago, Rick Brewster said:

Did you know about Effects -> Noise -> Median? That's what this new plugin is replicating, but running it on the GPU The CPU version (the built-in Median) is actually faster, too -- as it turns out, doing a median calculation is very expensive to do on the GPU!

Yeah I know. That thing is ultra fast.
Actually I made a CPU version before the GPU version for reference, and I put some effort to optimize it but built-in version still runs 20%-ish faster.
Seemingly the only way to make this O(n^2)->O(n) is FIFO optimization, which means we need to have a local buffer and process pixels sequentially. Not a good thing for a GPU.

Rick Brewster · February 1

8 hours ago, _koh_ said:

... we need to have a local buffer and process pixels sequentially. Not a good thing for a GPU.

It might be doable with compute shaders, but PDN's D2D wrappers don't have support for that yet, nor does @sergiopedri's ComputeSharp.D2D1.

_koh_ · February 3

Tweaked 1/4 sampling pattern and added 1/8 1/16 sampling.
While 1/4 looks nicer, feels slightly slower. Likely due to being less cache friendly.
1/8 looks surprisingly OK. 1/16 is just bad.

One thing I'm aware of is low sampling rate looks bad when radius is low, but it's difficult to tell when radius is high.
I thought maybe I can make this adaptive, but again each GPU is too different in performance.

edit:

Added 1/2 jitter to 1/16 sampling and now it looks slightly nicer.

Seems like it's better to have some jitter for even / odd sampling line.

edit2:

Tweaked 1/4 sampling again. Looks as good and more cache friendly.
Gonna stop here for now😅

Source code + dll. MedianFilterGPU.zip

Edited February 3 by _koh_

_koh_ · February 3

One thing I've already tested and abandoned.
Add subpixel jitter and do linear sampling to make low sampling rate looks nicer.

I thought I may get visual boost for free because of hardware sampler and caching, but
- It didn't look that much nicer.
- It wasn't free.

Edited February 3 by _koh_

_koh_ · February 3

OK this one is effective.
Now I'm sampling Pbgra32 buffer and this is like having 4x cache size.
Maybe we don't need 1/16 sampling anymore. MedianFilterGPU.zip

_koh_ · February 3

What will happen if I DrawImage() straight alpha data to Pbgra32? I'm doing this.

edit:

Result 100% pixel matches reference CPU version, so likely it's still straight. um

Edited February 3 by _koh_

_koh_ · February 4

Made sampling quality adaptive.
When radius < 8: quality +4, < 16: +3, < 32: +2, < 64: +1. So quality slider still has meaning.
With this, now quality = 2 on my IGPU is pretty tolerable in both quality and performance.

edit:

Maybe this is more readable.
MedianFilterGPU.zip

Edited February 4 by _koh_

_koh_ · February 4

Optimized filter is O(n) and this one is O(n^2), but now I'm making sampling rate 1/2 when radius is x2, so it's kinda O(n) in performance.

Rick Brewster · February 4

On 2/3/2024 at 8:50 AM, _koh_ said:

What will happen if I DrawImage() straight alpha data to Pbgra32? I'm doing this.

If you're just drawing without any blending -- which means either 1) first drawing call after Clear(), or 2) using CompositingMode.SourceCopy, then it's basically just memcpy()

Rick Brewster · February 4

I've been using your original code as a means of experimenting/researching into compute shaders in the PDN v5.1 code base. It would be easy for me to add compute shader support for the next servicing release of 5.0, which would be 5.0.13, as it's just exposing the necessary interfaces and methods in the Direct2D wrappers. There's still no support in ComputeSharp.D2D1 for this.

So I've converted it over to a compute shader. It gets much trickier when doing this as you have to manage your own scheduling (numthread and thread groups). I've implemented it such that each "thread" (one invocation of Execute()) writes an 8x4 block of pixels (pixel shader always writes out 1x1 per invocation). I use a resource texture to supply all of the sampling offsets, along with a bitmask indicating which pixel will use that sample. This lets me, for an 8x4 region anyway, only read each input pixel once instead of 8 times.

The performance speedup isn't dramatic: on a large 8192 x 4500 px image, at radius=100 and Full sampling, your original code takes ~13.5 seconds to render, while mine takes ~8.5 seconds. When I bump it up to 12 iterations of HiLo() -- which is necessary to get the right amount of precision to avoid banding artifacts -- it runs in about ~12.5 seconds. So, not really any performance gain but there is a really good quality gain. Oh, and this was on a GeForce 4090!

_koh_ · February 5

9 hours ago, Rick Brewster said:

If you're just drawing without any blending -- which means either 1) first drawing call after Clear(), or 2) using CompositingMode.SourceCopy, then it's basically just memcpy()

Thanks!

Now I have lots of convenient built-in features which is nice, but testing them to know if they do what I want to do is a bit time consuming.

9 hours ago, Rick Brewster said:

So I've converted it over to a compute shader. It gets much trickier when doing this as you have to manage your own scheduling (numthread and thread groups). I've implemented it such that each "thread" (one invocation of Execute()) writes an 8x4 block of pixels (pixel shader always writes out 1x1 per invocation).

Interesting.

I know a bit of basics through WebGPU and wondering which is better that doing thread per tile thing and do some optimization in it, or keeping it as parallel as possible and just brute-force. Looks like you get some gain if you do it properly.

_koh_ · February 15

Added smoothing. Applies sampling pattern sized blur kernel to reduce the artifact.

MedianFilterGPU.zip

When using 1/2, 1/4, 1/8 sampling, we are seeing the average of 2, 4, 8 median colors so manually averaging them does no harm and it worked surprisingly well. I think now quality = 2 is good enough for many.

Technically this is mean of medians and I can't explain why this looks closer to the true median than median of means or median of medians. We are processing images so it has its own bias I guess.

quality = 1, smoothing on/off, 200% zoom
image.png.711fb86146430de795f597e5cc7aff21.png image.png.a84519e647e42f7367009f8cc05bce93.png

Edited February 15 by _koh_

_koh_ · February 15

Now I'm using PrecisionEffect instead of CDC.Bitmap which gives me the same result at the same performance but I don't know what it's actually doing. Does it create intermediate buffer?

Rick Brewster · February 15

PrecisionEffect is a pass-through effect that uses a pixel shader to read the input image. This ensures Direct2D can't optimize it away. So yes, it is essentially forcing an intermediate buffer so that the next effect in the chain will consume the source at the given precision.

Source -> Precision -> NextEffect

This contrasts with PassthroughEffect which is a proper "passthrough" effect -- it uses ID2D1TransformGraph::SetPassthroughGraph() so it essentially "washes away" at render time as if it didn't even exist in the first place. It's not really useful for an effect graph, but it does have uses in some niche cases for architectural purposes. DynamicImage (e.g. PdnDentsEffect) uses this so that it can hand you the PassthroughEffect which you can plug into an effect graph, but then it can change which image/effect is plugged into that PassthroughEffect. This means you don't have to keep retrieving the DynamicImage's "output" when you change its properties (DynamicImage is not actually an ID2D1Image/ID2D1Effect).

It's very beneficial to use PrecisionEffect instead of a CompatibleDeviceContext.Bitmap because 1) that let's Direct2D manage the rendering process and memory management, and 2) it permits Paint.NET to manage rendering with tiles along with progress reporting and cancellation support. Otherwise you're forcing everything to render during OnCreateOutput(), during which there is no progress reporting or cancellation support.

Rick Brewster · February 15

On 2/4/2024 at 11:05 PM, _koh_ said:

Looks like you get some gain if you do it properly.

This compute shader's performance advantage seems to be that it greatly reduces the number of texture sampling instructions. It does not reduce the computational requirements -- each output pixel still needs to do the same amount of work. But there's up to an 87.5% reduction in texture sampling instructions because a sample that is used to compute multiple output pixels is only retrieved once. It likely doesn't reduce VRAM bandwidth because the GPU would be using an internal cache (e.g. L2) anyway, but it will reduce the bandwidth pressure on that internal cache.

Rick Brewster · February 15

1 hour ago, _koh_ said:

Now I'm using PrecisionEffect instead of CDC.Bitmap which gives me the same result at the same performance but I don't know what it's actually doing. Does it create intermediate buffer?

Another thing to note is that Paint.NET always runs effects at the highest precision (32-bit float per component / 128-bits per pixel). The SourceImage is still stored on the GPU as 32-bit BGRA, but is then premultiplied and/or color converted using 128-bpp to ensure the best quality. By using PrecisionEffect you are manually reducing the precision, which as you've seen can improve performance. However, it will of course reduce precision and color accuracy.

IMO it's not worth it, unless you're using caching (set effect.Properties.Cached to true) and you set the precision to Float16. This (caching) is almost never necessary, however, and should only be used very carefully and sparingly.

GPU Median Filter

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation