Jump to content

_koh_

Members
  • Posts

    76
  • Joined

  • Last visited

Everything posted by _koh_

  1. This MS histogram is a bit cumbersome to use so that's good to hear. In the latest version, I'm mapping input histogram to output histogram like my CPU version and I need to use at least 4096 bins to do that, but MS histogram only supports up to 1024 bins so I'm scanning the image 4 times then back calculating it. So if new histogram supports higher bin count, that's even better. private Vector4[] Histogram(IDeviceImage image, int prec) { using var idc = DC.CreateCompatibleDeviceContext(null, new(1024, 1024), DevicePixelFormats.Prgba128Float); using var odc = DC.CreateCompatibleDeviceContext(null, new(), DevicePixelFormats.Prgba128Float); var (bins, span) = (Math.Min(prec, 4) * 256, Math.Max(prec, 4) / 4); var data = new Vector4[span * bins]; var d = new RectInt32(0, 0, Environment.Document.Size); var t = new RectInt32(0, 0, idc.PixelSize); for (t.Offset(0, -t.Y); t.Y < d.Height; t.Offset(0, t.Height)) for (t.Offset(-t.X, 0); t.X < d.Width; t.Offset(t.Width, 0)) { var r = RectInt32.Intersect(d, t); using (idc.UseBeginDraw()) idc.DrawImage(image, null, r, compositeMode: CompositeMode.SourceCopy); foreach (var n in new[] {0, 1, 2}) for (var j = 0; j < span; j++) { var (l, o) = (255f / 256, 0.5f / 256 - (j - span + 1f) / (span * bins)); using var size = new CropEffect(DC, idc.Bitmap, new(0, 0, r.Size)); using var tran = new ColorMatrixEffect(DC, size, new(l,0,0,0, 0,l,0,0, 0,0,l,0, 0,0,0,1, o,o,o,0)); using var hist = new HistogramEffect(DC); hist.Properties.Input.Set(tran); hist.Properties.Bins.SetValue(bins); hist.Properties.ChannelSelect.SetValue((ChannelSelector)n); using (odc.UseBeginDraw()) odc.DrawImage(hist); var (v, s) = (hist.Properties.HistogramOutput.GetValue(), (float)r.Area / d.Area); for (var i = 0; i < bins; i++) data[span * i + j][n] += v[i] * s; } } var sum = (Span<Vector4> a) => {var x = Vector4.Zero; foreach (var v in a) x += v; return x;}; for (var i = 0; i < data.Length; i++) data[i] -= sum(data.AsSpan(Math.Max(i - span + 1, 0)..i)); return data; } full source code LightBalanceGPU.zip
  2. Posting latest version with 2bit binary search, or more like quarter search. source code + dll MedianFilterGPU.zip Additionally I made versions which compute 2,3,4 pixel colors at once, then ran them on 1/2,1/3,1/4 sized images to estimate compute shader performance. 8K image, radius 100, sampling rate 1/4, RTX 3060 laptop No optimization - 18.2s INT8 sampling - 8.6s 2bit binary search - 10.2s pseudo 2,3,4 pixel output - 10.2s, 8.2s, 7.8s INT8 sampling + 2bit binary search - 7.2s INT8 sampling + pseudo 2,3,4 pixel output - 6.9s Looks like 2.6x original version is the performance ceiling on my GPU, and this latest version is at 2.5x. 2bit binary search need to test 3 thresholds inside of the loop to make loop iteration 1/2. Maybe that's why it runs slightly slower. 2bit binary search shader private readonly partial struct Render : ID2D1PixelShader { private readonly float r, p; private readonly float3 d; private float4 HiLo(float4 c, float v) { float3x4 n = 0; float m = 0; float y = r % d.Y - r; for (; y <= r; y += d.Y) { float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y)); float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w; for (; x <= w; x += d.X) { float4 s = D2D.SampleInputAtOffset(0, new(x, y)); n += Hlsl.Step(new float3x4(s, s, s), new(c - v, c, c + v)); m += 1; } } return (float3)1 * (1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100)); } public float4 Execute() { float4 c = 0.5f; float v = 0.5f; c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); return c; } } pseudo 4 pixel output shader private readonly partial struct Render : ID2D1PixelShader { private readonly float r, p; private readonly float3 d; private float4x4 HiLo(float4x4 c, float2 o) { float4x4 n = 0; float m = 0; float y = r % d.Y - r; for (; y <= r; y += d.Y) { float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y)); float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w; for (; x - d.X * 3 <= w; x += d.X) { // float4 s = input[(int2)(o + new float2(x, y))]; float4 s = D2D.SampleInputAtPosition(0, o + new float2(x, y)); float4 a = Hlsl.Step(Hlsl.Abs(x - d.X * new float4(0, 1, 2, 3)), w); n += new float4x4(a.X, a.Y, a.Z, a.W) * Hlsl.Step(new(s, s, s, s), c); m += a.X; } } return 1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100); } public float4 Execute() { // float2 o = new(ThreadIds.X * 4 - ThreadIds.X % d.X * 3, ThreadIds.Y); float2 o = D2D.GetScenePosition().XY; float4x4 c = 0.5f; float v = 0.5f; c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); // output[(int2)(o + new float2(d.X * 0, 0))] = c[0]; // output[(int2)(o + new float2(d.X * 1, 0))] = c[1]; // output[(int2)(o + new float2(d.X * 2, 0))] = c[2]; // output[(int2)(o + new float2(d.X * 3, 0))] = c[3]; return (float4)1 / 4 * c; } }
  3. My last post sounds mess even for my english. haha I was trying to say, looks like original shader idling 75% of the time because of bandwidth or latency, so doing 4x computing per sample and make loop iterations 1/4 might be the sweet spot. I expect making any part of for(y) for(x) for(i) loop 1/4 has the same effect, but 1/4 i loop requires 8x computing per sample and pixel shader can't configure the xy loop. I tested 1/2 i loop configuration and got 15% boost with it. My latest version doing INT8 sampling so not that much room left apparently. edit: I found performance ceiling on my GPU is rather 3x than 4x, so likely GPU clock dependent. When 2GHz GPU is idling 75% of the time, 1.5GHz GPU is idling 66% of the time and such.
  4. Seems like at least you can have 4x compute / fetch compared to the original shader for free. What if you change arity=2 and keep output 4 pixels? That's another 4x compute / fetch setup I believe.
  5. That's neat. Never thought about pre-compute better pivot and start with it. Haven't went thorough the code so I might be wrong, but basically when min-max of samples being < 1/2, we only need 7 tests instead of 8, when < 1/4 we only need 6 and so on. And we don't need exact min-max to narrow down the range, so you are doing square sampling instead of circle to do V->H optimization. Something like that I guess. Yeah, we can make sampling 1/n with 2^(n-1) registers. I considered about it but never tested. 2x registers means 1/2 threads GPU can fly, so 1/2, 1/3, 1/4 sampling with 1/2, 1/4, 1/8 threads. Going beyond n=2 unlikely worth it, but very possible n=2 is better than n=1 in general. I'm just curious, but making it multi pass is better than using more tiles? PDN already doing tile rendering. Yeah, while linear color space gives us more accurate results in general, doing histogram, dithering etc. in the original color space makes more sense.
  6. This is basically a histogram without an array. Instead of having bins[256] and calculate tally to test which bin being the median, calculate tally every time and use binary search to decide which bin being the median. And histogram is one thing we should do in the storage format, which could be sRGB or anything. Maybe this is better explanation.
  7. Ah maybe that's good enough. I can already choose from FP32/FP16/INT8 and likely FP32 is a bit overkill for the nature of this effect. Correct. This shader can only pick one value from the pre determined values so when we run this in the storage format color space, we get the best result with the least steps. Actually if we use linear value thresholds, we can end this with 8 steps in linear color space but we need to convert color space 8 times in the shader, so converting the buffer to the storage format beforehand is the better way to do this.
  8. When you said you are calling HiLo() 12 times to avoid banding, I thought that's weird because you get the same results anyway but now it's more making sense. In that case, I want to match intermediate buffer precision to the original buffer precision so I hope there will be a way to do that. Or can I do that already?
  9. Thanks for this. Seems like I should generally avoid manually creating a buffer unless rendering it is taxing and I can reuse it in the following tokens. In this case, at least it makes changing sliders and such more performant. Yeah, 32KB L1 can hold 128x128 INT8 pixels so when radius is relatively low like < 32, likely GPU reading from the VRAM once and reading from the L1 rest of the time. This particular shader only takes straight alpha sRGB and outputs straight alpha sRGB, so I get 100% pixel matching results regardless using FP32 or INT8. Now I'm applying a blur filter to the output and I wanted to do it in pre-multiplied linear, so I'm bringing it back to FP32 before that. using var input = new PrecisionEffect(DC, source, BufferPrecision.UInt8Normalized); using var expand = new BorderEffect(DC, input, BorderEdgeMode.Clamp); using var render = Shader([expand], new(radius, percent, delta), [], D2D1TransformMapperFactory<Render>.Inflate(radius)); using var shrink = new CropEffect(DC, render, new(0, 0, Environment.Document.Size)); using var output = new PrecisionEffect(DC, shrink, BufferPrecision.Float32);
  10. Now I'm using PrecisionEffect instead of CDC.Bitmap which gives me the same result at the same performance but I don't know what it's actually doing. Does it create intermediate buffer?
  11. Added smoothing. Applies sampling pattern sized blur kernel to reduce the artifact. MedianFilterGPU.zip When using 1/2, 1/4, 1/8 sampling, we are seeing the average of 2, 4, 8 median colors so manually averaging them does no harm and it worked surprisingly well. I think now quality = 2 is good enough for many. Technically this is mean of medians and I can't explain why this looks closer to the true median than median of means or median of medians. We are processing images so it has its own bias I guess. quality = 1, smoothing on/off, 200% zoom
  12. Thanks! Now I have lots of convenient built-in features which is nice, but testing them to know if they do what I want to do is a bit time consuming. Interesting. I know a bit of basics through WebGPU and wondering which is better that doing thread per tile thing and do some optimization in it, or keeping it as parallel as possible and just brute-force. Looks like you get some gain if you do it properly.
  13. Optimized filter is O(n) and this one is O(n^2), but now I'm making sampling rate 1/2 when radius is x2, so it's kinda O(n) in performance.
  14. Made sampling quality adaptive. When radius < 8: quality +4, < 16: +3, < 32: +2, < 64: +1. So quality slider still has meaning. With this, now quality = 2 on my IGPU is pretty tolerable in both quality and performance. edit: Maybe this is more readable. MedianFilterGPU.zip
  15. What will happen if I DrawImage() straight alpha data to Pbgra32? I'm doing this. edit: Result 100% pixel matches reference CPU version, so likely it's still straight. um
  16. OK this one is effective. Now I'm sampling Pbgra32 buffer and this is like having 4x cache size. Maybe we don't need 1/16 sampling anymore. MedianFilterGPU.zip
  17. One thing I've already tested and abandoned. Add subpixel jitter and do linear sampling to make low sampling rate looks nicer. I thought I may get visual boost for free because of hardware sampler and caching, but - It didn't look that much nicer. - It wasn't free.
  18. Tweaked 1/4 sampling pattern and added 1/8 1/16 sampling. While 1/4 looks nicer, feels slightly slower. Likely due to being less cache friendly. 1/8 looks surprisingly OK. 1/16 is just bad. One thing I'm aware of is low sampling rate looks bad when radius is low, but it's difficult to tell when radius is high. I thought maybe I can make this adaptive, but again each GPU is too different in performance. edit: Added 1/2 jitter to 1/16 sampling and now it looks slightly nicer. Seems like it's better to have some jitter for even / odd sampling line. edit2: Tweaked 1/4 sampling again. Looks as good and more cache friendly. Gonna stop here for now😅 Source code + dll. MedianFilterGPU.zip
  19. Yeah I know. That thing is ultra fast. Actually I made a CPU version before the GPU version for reference, and I put some effort to optimize it but built-in version still runs 20%-ish faster. Seemingly the only way to make this O(n^2)->O(n) is FIFO optimization, which means we need to have a local buffer and process pixels sequentially. Not a good thing for a GPU.
  20. Totally fine. This is more like proof of concept and basically if I post any code anyone can do anything with it. edit: If you optimized this, please educate me how you did it. I've tested this version, but I still have both input set simple so I believe those 8 shaders merged into 1 in the end. And shader function version runs 50%-ish faster than this one. protected override IDeviceImage OnCreateOutput(PaintDotNet.Direct2D1.IDeviceContext DC) { var radius = (int)Token.GetProperty(PropertyNames.Radius ).Value; var percent = (int)Token.GetProperty(PropertyNames.Percent).Value; var sample = (int)Token.GetProperty(PropertyNames.Sample ).Value; var delta = new Vector2[] {new(1, 1), new(2, 1), new(2, 2)}[sample]; var mapper = D2D1TransformMapperFactory<Render>.Inflate(radius); var output = (IDeviceImage)new FloodEffect(DC, new(0.5f)); for (var (ratio, i) = (0.5f, 0); i < 8; i++) { using var source = new BorderEffect(DC, Environment.SourceImage, BorderEdgeMode.Clamp); using var input = output; output = Shader([source, input], new(ratio *= 0.5f, radius, percent, delta), [], mapper); } return output; } [D2DInputCount(2), D2DInputSimple(0), D2DInputSimple(1), D2DInputDescription(0, D2D1Filter.MinMagMipPoint), AutoConstructor] private readonly partial struct Render : ID2D1PixelShader { private readonly float ratio, radius, percent; private readonly float2 delta; public float4 Execute() { float4 c = D2D.GetInput(1); float4 n = 0; float m = 0; float2 o = 0, p = 0, q = 0; q.Y = radius; p.Y = q.Y % delta.Y - q.Y; for (o.Y = p.Y; o.Y <= q.Y; o.Y += delta.Y) { q.X = Hlsl.Trunc(Hlsl.Sqrt(q.Y * q.Y - o.Y * o.Y)); p.X = Hlsl.Abs(q.X - o.Y) % delta.X - q.X; for (o.X = p.X; o.X <= q.X; o.X += delta.X) { float4 s = D2D.SampleInputAtOffset(0, o); n += Hlsl.Step(s, c); m += 1; } } return c + ratio * (float4)Hlsl.Sign(Hlsl.Max(m * percent, 1) - n * 100 - 0.5f); } }
  21. I was assuming shader.SetInput(0, new EmptyEffect(DC)) then do D2D.GetInput(0) is exactly the same thing having float4 EmptyEffect() => 0 in my shader then do EmptyEffect() after the shader linking. Not only results but how they run. What you are suggesting is intentionally use D2DInputComplex to prevent shader linking and split them up? edit: I only have rough idea about how shader linking works, so likely my question being a bit off. At least I understand two blur effects can't be linked.
  22. So linked shader and shader function work differently? interesting. I'm new to this so I was assuming everything gets inlined in the end. This is binary search, and while I already had it in my toolbox, this is the first time I use it in this way. Like doing a lot of computing to decide which path to go. And yeah, I'm mostly a database guy actually so maybe that affecting how I explore the ideas. haha Thanks! I'm just doing it's 'Color' if it's not 'Photo', so I'll move it there.
  23. Was thinking about GPU median since I saw this thread and I think I've found a usable solution. Basically a Hi-Lo based algorithm, kinda like B-tree and such. [D2DInputCount(1), D2DInputSimple(0), D2DInputDescription(0, D2D1Filter.MinMagMipPoint), AutoConstructor] private readonly partial struct Render : ID2D1PixelShader { private readonly float radius, percent; private readonly float2 delta; private float4 HiLo(float4 c) { float4 n = 0; float m = 0; float2 o = 0, p = 0, q = 0; q.Y = radius; p.Y = q.Y % delta.Y - q.Y; for (o.Y = p.Y; o.Y <= q.Y; o.Y += delta.Y) { q.X = Hlsl.Trunc(Hlsl.Sqrt(q.Y * q.Y - o.Y * o.Y)); p.X = Hlsl.Abs(q.X - o.Y) % delta.X - q.X; for (o.X = p.X; o.X <= q.X; o.X += delta.X) { float4 s = D2D.SampleInputAtOffset(0, o); n += Hlsl.Step(s, c); m += 1; } } return Hlsl.Sign(Hlsl.Max(m * percent, 1) - n * 100 - 0.5f); } public float4 Execute() { float4 c = 0.5f; float d = 0.5f; c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); c += HiLo(c) * (d *= 0.5f); return c; } } On my RTX 3060 Laptop, performance of quarter sampling mode matches FIFO optimized CPU median at radius = 100, but some artifacts are visible in places. On my IGPU, quarter sampling mode at radius = 50 runs ...okay. At least it doesn't take forever. A bit difficult to judge which algorithm is usable because each GPU is too different in performance. Full source code + dll. MedianFilterGPU.zip
  24. This is the deference by the way. Only visible when using very low opacity layer + making the image very dark / bright.
  25. Umm. Maybe I should drop this from the published version but I'm using this in some cases and will keep it for myself anyway so I want to hear. Thanks in advance! Hope this is as easy as mark buffers "mapped" or something like that. haha
×
×
  • Create New...