Jump to content

_koh_

Members
  • Posts

    77
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

_koh_'s Achievements

Enthusiast

Enthusiast (6/14)

  • Conversation Starter
  • One Year In
  • Collaborator
  • One Month Later
  • First Post

Recent Badges

10

Reputation

  1. This MS histogram is a bit cumbersome to use so that's good to hear. In the latest version, I'm mapping input histogram to output histogram like my CPU version and I need to use at least 4096 bins to do that, but MS histogram only supports up to 1024 bins so I'm scanning the image 4 times then back calculating it. So if new histogram supports higher bin count, that's even better. private Vector4[] Histogram(IDeviceImage image, int prec) { using var idc = DC.CreateCompatibleDeviceContext(null, new(1024, 1024), DevicePixelFormats.Prgba128Float); using var odc = DC.CreateCompatibleDeviceContext(null, new(), DevicePixelFormats.Prgba128Float); var (bins, span) = (Math.Min(prec, 4) * 256, Math.Max(prec, 4) / 4); var data = new Vector4[span * bins]; var d = new RectInt32(0, 0, Environment.Document.Size); var t = new RectInt32(0, 0, idc.PixelSize); for (t.Offset(0, -t.Y); t.Y < d.Height; t.Offset(0, t.Height)) for (t.Offset(-t.X, 0); t.X < d.Width; t.Offset(t.Width, 0)) { var r = RectInt32.Intersect(d, t); using (idc.UseBeginDraw()) idc.DrawImage(image, null, r, compositeMode: CompositeMode.SourceCopy); foreach (var n in new[] {0, 1, 2}) for (var j = 0; j < span; j++) { var (l, o) = (255f / 256, 0.5f / 256 - (j - span + 1f) / (span * bins)); using var size = new CropEffect(DC, idc.Bitmap, new(0, 0, r.Size)); using var tran = new ColorMatrixEffect(DC, size, new(l,0,0,0, 0,l,0,0, 0,0,l,0, 0,0,0,1, o,o,o,0)); using var hist = new HistogramEffect(DC); hist.Properties.Input.Set(tran); hist.Properties.Bins.SetValue(bins); hist.Properties.ChannelSelect.SetValue((ChannelSelector)n); using (odc.UseBeginDraw()) odc.DrawImage(hist); var (v, s) = (hist.Properties.HistogramOutput.GetValue(), (float)r.Area / d.Area); for (var i = 0; i < bins; i++) data[span * i + j][n] += v[i] * s; } } var sum = (Span<Vector4> a) => {var x = Vector4.Zero; foreach (var v in a) x += v; return x;}; for (var i = 0; i < data.Length; i++) data[i] -= sum(data.AsSpan(Math.Max(i - span + 1, 0)..i)); return data; } full source code LightBalanceGPU.zip
  2. Posting latest version with 2bit binary search, or more like quarter search. source code + dll MedianFilterGPU.zip Additionally I made versions which compute 2,3,4 pixel colors at once, then ran them on 1/2,1/3,1/4 sized images to estimate compute shader performance. 8K image, radius 100, sampling rate 1/4, RTX 3060 laptop No optimization - 18.2s INT8 sampling - 8.6s 2bit binary search - 10.2s pseudo 2,3,4 pixel output - 10.2s, 8.2s, 7.8s INT8 sampling + 2bit binary search - 7.2s INT8 sampling + pseudo 2,3,4 pixel output - 6.9s Looks like 2.6x original version is the performance ceiling on my GPU, and this latest version is at 2.5x. 2bit binary search need to test 3 thresholds inside of the loop to make loop iteration 1/2. Maybe that's why it runs slightly slower. 2bit binary search shader private readonly partial struct Render : ID2D1PixelShader { private readonly float r, p; private readonly float3 d; private float4 HiLo(float4 c, float v) { float3x4 n = 0; float m = 0; float y = r % d.Y - r; for (; y <= r; y += d.Y) { float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y)); float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w; for (; x <= w; x += d.X) { float4 s = D2D.SampleInputAtOffset(0, new(x, y)); n += Hlsl.Step(new float3x4(s, s, s), new(c - v, c, c + v)); m += 1; } } return (float3)1 * (1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100)); } public float4 Execute() { float4 c = 0.5f; float v = 0.5f; c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); c += HiLo(c, v *= 0.5f) * (v *= 0.5f); return c; } } pseudo 4 pixel output shader private readonly partial struct Render : ID2D1PixelShader { private readonly float r, p; private readonly float3 d; private float4x4 HiLo(float4x4 c, float2 o) { float4x4 n = 0; float m = 0; float y = r % d.Y - r; for (; y <= r; y += d.Y) { float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y)); float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w; for (; x - d.X * 3 <= w; x += d.X) { // float4 s = input[(int2)(o + new float2(x, y))]; float4 s = D2D.SampleInputAtPosition(0, o + new float2(x, y)); float4 a = Hlsl.Step(Hlsl.Abs(x - d.X * new float4(0, 1, 2, 3)), w); n += new float4x4(a.X, a.Y, a.Z, a.W) * Hlsl.Step(new(s, s, s, s), c); m += a.X; } } return 1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100); } public float4 Execute() { // float2 o = new(ThreadIds.X * 4 - ThreadIds.X % d.X * 3, ThreadIds.Y); float2 o = D2D.GetScenePosition().XY; float4x4 c = 0.5f; float v = 0.5f; c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); c += HiLo(c, o) * (v *= 0.5f); // output[(int2)(o + new float2(d.X * 0, 0))] = c[0]; // output[(int2)(o + new float2(d.X * 1, 0))] = c[1]; // output[(int2)(o + new float2(d.X * 2, 0))] = c[2]; // output[(int2)(o + new float2(d.X * 3, 0))] = c[3]; return (float4)1 / 4 * c; } }
  3. My last post sounds mess even for my english. haha I was trying to say, looks like original shader idling 75% of the time because of bandwidth or latency, so doing 4x computing per sample and make loop iterations 1/4 might be the sweet spot. I expect making any part of for(y) for(x) for(i) loop 1/4 has the same effect, but 1/4 i loop requires 8x computing per sample and pixel shader can't configure the xy loop. I tested 1/2 i loop configuration and got 15% boost with it. My latest version doing INT8 sampling so not that much room left apparently. edit: I found performance ceiling on my GPU is rather 3x than 4x, so likely GPU clock dependent. When 2GHz GPU is idling 75% of the time, 1.5GHz GPU is idling 66% of the time and such.
  4. Seems like at least you can have 4x compute / fetch compared to the original shader for free. What if you change arity=2 and keep output 4 pixels? That's another 4x compute / fetch setup I believe.
  5. That's neat. Never thought about pre-compute better pivot and start with it. Haven't went thorough the code so I might be wrong, but basically when min-max of samples being < 1/2, we only need 7 tests instead of 8, when < 1/4 we only need 6 and so on. And we don't need exact min-max to narrow down the range, so you are doing square sampling instead of circle to do V->H optimization. Something like that I guess. Yeah, we can make sampling 1/n with 2^(n-1) registers. I considered about it but never tested. 2x registers means 1/2 threads GPU can fly, so 1/2, 1/3, 1/4 sampling with 1/2, 1/4, 1/8 threads. Going beyond n=2 unlikely worth it, but very possible n=2 is better than n=1 in general. I'm just curious, but making it multi pass is better than using more tiles? PDN already doing tile rendering. Yeah, while linear color space gives us more accurate results in general, doing histogram, dithering etc. in the original color space makes more sense.
  6. This is basically a histogram without an array. Instead of having bins[256] and calculate tally to test which bin being the median, calculate tally every time and use binary search to decide which bin being the median. And histogram is one thing we should do in the storage format, which could be sRGB or anything. Maybe this is better explanation.
  7. Ah maybe that's good enough. I can already choose from FP32/FP16/INT8 and likely FP32 is a bit overkill for the nature of this effect. Correct. This shader can only pick one value from the pre determined values so when we run this in the storage format color space, we get the best result with the least steps. Actually if we use linear value thresholds, we can end this with 8 steps in linear color space but we need to convert color space 8 times in the shader, so converting the buffer to the storage format beforehand is the better way to do this.
  8. When you said you are calling HiLo() 12 times to avoid banding, I thought that's weird because you get the same results anyway but now it's more making sense. In that case, I want to match intermediate buffer precision to the original buffer precision so I hope there will be a way to do that. Or can I do that already?
  9. Thanks for this. Seems like I should generally avoid manually creating a buffer unless rendering it is taxing and I can reuse it in the following tokens. In this case, at least it makes changing sliders and such more performant. Yeah, 32KB L1 can hold 128x128 INT8 pixels so when radius is relatively low like < 32, likely GPU reading from the VRAM once and reading from the L1 rest of the time. This particular shader only takes straight alpha sRGB and outputs straight alpha sRGB, so I get 100% pixel matching results regardless using FP32 or INT8. Now I'm applying a blur filter to the output and I wanted to do it in pre-multiplied linear, so I'm bringing it back to FP32 before that. using var input = new PrecisionEffect(DC, source, BufferPrecision.UInt8Normalized); using var expand = new BorderEffect(DC, input, BorderEdgeMode.Clamp); using var render = Shader([expand], new(radius, percent, delta), [], D2D1TransformMapperFactory<Render>.Inflate(radius)); using var shrink = new CropEffect(DC, render, new(0, 0, Environment.Document.Size)); using var output = new PrecisionEffect(DC, shrink, BufferPrecision.Float32);
  10. Now I'm using PrecisionEffect instead of CDC.Bitmap which gives me the same result at the same performance but I don't know what it's actually doing. Does it create intermediate buffer?
  11. Added smoothing. Applies sampling pattern sized blur kernel to reduce the artifact. MedianFilterGPU.zip When using 1/2, 1/4, 1/8 sampling, we are seeing the average of 2, 4, 8 median colors so manually averaging them does no harm and it worked surprisingly well. I think now quality = 2 is good enough for many. Technically this is mean of medians and I can't explain why this looks closer to the true median than median of means or median of medians. We are processing images so it has its own bias I guess. quality = 1, smoothing on/off, 200% zoom
  12. Thanks! Now I have lots of convenient built-in features which is nice, but testing them to know if they do what I want to do is a bit time consuming. Interesting. I know a bit of basics through WebGPU and wondering which is better that doing thread per tile thing and do some optimization in it, or keeping it as parallel as possible and just brute-force. Looks like you get some gain if you do it properly.
  13. Optimized filter is O(n) and this one is O(n^2), but now I'm making sampling rate 1/2 when radius is x2, so it's kinda O(n) in performance.
  14. Made sampling quality adaptive. When radius < 8: quality +4, < 16: +3, < 32: +2, < 64: +1. So quality slider still has meaning. With this, now quality = 2 on my IGPU is pretty tolerable in both quality and performance. edit: Maybe this is more readable. MedianFilterGPU.zip
  15. What will happen if I DrawImage() straight alpha data to Pbgra32? I'm doing this. edit: Result 100% pixel matches reference CPU version, so likely it's still straight. um
×
×
  • Create New...