GPU Median Filter

_koh_ · February 15

2 hours ago, Rick Brewster said:

It's very beneficial to use PrecisionEffect instead of a CompatibleDeviceContext.Bitmap because 1) that let's Direct2D manage the rendering process and memory management, and 2) it permits Paint.NET to manage rendering with tiles along with progress reporting and cancellation support. Otherwise you're forcing everything to render during OnCreateOutput(), during which there is no progress reporting or cancellation support.

Thanks for this.

Seems like I should generally avoid manually creating a buffer unless rendering it is taxing and I can reuse it in the following tokens. In this case, at least it makes changing sliders and such more performant.

2 hours ago, Rick Brewster said:

It likely doesn't reduce VRAM bandwidth because the GPU would be using an internal cache (e.g. L2) anyway, but it will reduce the bandwidth pressure on that internal cache.

Yeah, 32KB L1 can hold 128x128 INT8 pixels so when radius is relatively low like < 32, likely GPU reading from the VRAM once and reading from the L1 rest of the time.

2 hours ago, Rick Brewster said:

By using PrecisionEffect you are manually reducing the precision, which as you've seen can improve performance. However, it will of course reduce precision and color accuracy.

This particular shader only takes straight alpha sRGB and outputs straight alpha sRGB, so I get 100% pixel matching results regardless using FP32 or INT8.

Now I'm applying a blur filter to the output and I wanted to do it in pre-multiplied linear, so I'm bringing it back to FP32 before that.

using var input  = new PrecisionEffect(DC, source, BufferPrecision.UInt8Normalized);
using var expand = new BorderEffect(DC, input, BorderEdgeMode.Clamp);
using var render = Shader([expand], new(radius, percent, delta), [], D2D1TransformMapperFactory<Render>.Inflate(radius));
using var shrink = new CropEffect(DC, render, new(0, 0, Environment.Document.Size));
using var output = new PrecisionEffect(DC, shrink, BufferPrecision.Float32);

Rick Brewster · February 15

40 minutes ago, _koh_ said:

This particular shader only takes straight alpha sRGB and outputs straight alpha sRGB

btw this won't necessarily be true in future releases of Paint.NET

First, the upcoming v5.1 will have color management -- so an effect will either receive pixels in the image's "working space" (which is currently de facto sRGB) which is still the unmodified BGRA32 values, or the pixels will be converted to the linearized version of the image's actual color profile (WorkingSpaceLinear is the default). As a backup, in case the color profile can't be linearized, the image will be converted to scRGB (linear sRGB). The effect's output will then be automatically converted back to the storage format of the image.

Second, for future releases I am planning on adding higher-precision pixel formats like RGBA64 (4 x uint16), RGBA64Half (4 x float16), and even RGBA128Float (4 x float32).

In other words, I would not rely on PrecisionEffect(UInt8Normalized) as a way to maintain the original precision -- because that won't be true in the future. I designed the new effect systems with future-proofing in mind!

_koh_ · February 15

34 minutes ago, Rick Brewster said:

In other words, I would not rely on PrecisionEffect(UInt8Normalized) as a way to maintain the original precision -- because that won't be true in the future. I designed the new effect systems with future-proofing in mind!

When you said you are calling HiLo() 12 times to avoid banding, I thought that's weird because you get the same results anyway but now it's more making sense.

In that case, I want to match intermediate buffer precision to the original buffer precision so I hope there will be a way to do that. Or can I do that already?

Rick Brewster · February 15

Just now, _koh_ said:

I want to match intermediate buffer precision to the original buffer precision

This information isn't available in the plugin interfaces. I would simply add an option in the UI for low vs. full precision.

2 minutes ago, _koh_ said:

When you said you are calling HiLo() 12 times to avoid banding

From what I could understand from the algorithm, each call to HiLo() essentially calculates 1 bit of precision starting from the most-significant bit. When working with linearized pixels (that is, WorkingSpaceLinear instead of WorkingSpace), you need up to 12-bits because the values are spread out differently.

The increase from 8 to 12 is pretty dramatic with some images, but I could only see very minute differences after that. Even 11 to 12 was very small, but still noticeable upon close inspection.

Going forward it may be necessary to run up to 16 times, but it's easy to make it configurable for when that comes up.

_koh_ · February 16

12 minutes ago, Rick Brewster said:

This information isn't available in the plugin interfaces. I would simply add an option in the UI for low vs. full precision.

Ah maybe that's good enough. I can already choose from FP32/FP16/INT8 and likely FP32 is a bit overkill for the nature of this effect.

16 minutes ago, Rick Brewster said:

From what I could understand from the algorithm, each call to HiLo() essentially calculates 1 bit of precision starting from the most-significant bit.

Correct. This shader can only pick one value from the pre determined values so when we run this in the storage format color space, we get the best result with the least steps.

Actually if we use linear value thresholds, we can end this with 8 steps in linear color space but we need to convert color space 8 times in the shader, so converting the buffer to the storage format beforehand is the better way to do this.

_koh_ · February 17

This is basically a histogram without an array.
Instead of having bins[256] and calculate tally to test which bin being the median, calculate tally every time and use binary search to decide which bin being the median.
And histogram is one thing we should do in the storage format, which could be sRGB or anything.
Maybe this is better explanation.

Rick Brewster · February 25

I've been able to optimize this further vs. the original shader (at full sampling) in this post, cutting down the execution time by about 42% -- without using a compute shader (although that's my next step!) and while also improving quality.

Here's how I did it:

Instead of starting at a pivot point of c=0.5, I use the output of shaders that calculate the min and max for the neighborhood square kernel area. This then establishes the traditional lo, hi, and pivot values for the binary search. This is less precise than taking the min/max of the circle kernel area, but can execute substantially faster (linear instead of quadratic) because these are separable kernels. This has two side effects: 1) it increases precision in areas that have a smaller dynamic range, and 2) it supports any dynamic range of values, not just those that are [0,1].
Binary search provides 1-bit per iteration. I also implemented 4-ary, 8-ary, and 16-ary. I kept only 4-ary enabled because it has the best mix of performance and can reach 8-bits of output in 4 iterations (instead of 8 iterations w/ binary search). The 8-ary can hit 9-bits in 3 iterations, which is more than we need. The 16-ary can hit 8-bits in 2 iterations but because it's using so many registers it actually runs slower due to reduced shader occupancy.
The search now produces the wrong result when percentile=0 because they can only output the value from the localized min shader, which is often providing the min value for a pixel outside of the circular kernel. This means you get "squares" instead of "circles" in the output. I special-case this to use a different shader that finds the minimum value within the circular kernel. It's possible to incorporate this logic into the regular n-ary shader methods, but it significantly reduces performance.

For my performance testing, I used an ~~18K x 12K~~ 12K x 8K image. I set radius to 100, percentile to 75, and then used either "Full" sampling (w/ your original shader), or the default iteration count (for my shaders). Your original shader took 30.7 second, while my 4-ary implementation takes 17.8 seconds (with higher quality!).

The next steps for optimization would seem to be using a compute shader, which could calculate multiple output pixels at once. This should be able to bring that 17.8 down even further, meaning this might even be shippable as a built-in PDN effect! And a quality slider that chooses full vs. half vs. etc. sampling would also enable faster performance (like your shader does).

I'd also like to separate each iteration of the algorithm into its own rendering pass. This would definitely require a compute shader, as it would need to write out 2 additional float4s in order to provide the hi/lo markers (so the output image would be 3x the width of the input image, and then a final shader would discard those 2 extra values). This would enable the effect to run without monopolizing the GPU as much and would help to avoid causing major UI lag. I don't think it would improve performance, but I need to see how it goes.

Here's the code I've got so far. It's using some PDN internal stuff (like PixelShaderEffect<T>), but you can still translate it to not use the internal stuff.

Spoiler

//#define SEARCH_ARITY_2
#define SEARCH_ARITY_4
//#define SEARCH_ARITY_8
//#define SEARCH_ARITY_16

using ComputeSharp;
using ComputeSharp.D2D1;
using PaintDotNet.Collections;
using PaintDotNet.ComponentModel;
using PaintDotNet.Direct2D1;
using PaintDotNet.Direct2D1.Effects;
using PaintDotNet.Rendering;
using System;
using System.Diagnostics;
using System.Runtime.CompilerServices;

namespace PaintDotNet.Effects.Gpu;

// TODO: Describe algorithm
// TODO: doc comments
public sealed partial class PdnMedianEffect
    : CustomEffect<PdnMedianEffect.Props>
{
    internal const int MaxRadius = 100;

    // Allow slider to go up to "12 bits" worth of precision. We actually get more effective bits due to
    // other particulars of the algorithm, but this is a good stopping point that allows them all to line up.
    // 2-ary search generates 1 bit per iteration, 4-ary search generates 2 bits, 8-ary search generates
    // 3 bits, and 16-ary search generates 4-bits.
    internal const int MaxIterations =
#if SEARCH_ARITY_2
        12
#elif SEARCH_ARITY_4
        6
#elif SEARCH_ARITY_8
        4
#elif SEARCH_ARITY_16
        3
#endif
        ;

    // The default iterations is for either 8-bits or 9-bits (for arity 8) 
    internal const int DefaultIterations = (MaxIterations * 2) / 3; // 8, 4, 3, 2

    public PdnMedianEffect(IDeviceEffectFactory factory)
        : base(factory)
    {
    }

    public sealed class Props
        : CustomEffectProperties
    {
        protected override CustomEffectImpl CreateImpl()
        {
            return new Impl();
        }

        public EffectInputAccessor Input => CreateInputAccessor(0);

        /// <summary>
        /// The radius of the effect. A value of 0 will disable the effect.<br/>
        /// Performance scales with the square of this property value. Doubling the radius will quadruple the rendering time.<br/>
        /// The valid range is [0, 100], the default is 25.
        /// </summary>
        public EffectPropertyAccessor<float> Radius => CreateFloatPropertyAccessor(0, radiusSpec);
        private static readonly EffectPropertyValueSpec radiusSpec = new EffectPropertyValueSpec(25.0f, 0.0f, MedianApproximationEffect.MaxRadius);

        /// <summary>
        /// Specifies the percentile to use when approximating the median. Lower values result in darkened/eroded results,
        /// while higher values result in brightened/dilated results.<br/>
        /// The range is [0,1] which correspond's to the UI's range of [0, 100]. The default value is 0.5f.
        /// </summary>
        public EffectPropertyAccessor<float> Percentile => CreateFloatPropertyAccessor(1, percentileSpec);
        private static readonly EffectPropertyValueSpec percentileSpec = new EffectPropertyValueSpec(0.5f, 0.0f, 1.0f);

        // TODO: doc comment, value spec
        public EffectPropertyAccessor<int> Iterations => CreateInt32PropertyAccessor(2); // [1, MaxIterations]

        /// <summary>
        /// Specifies how sampling beyond the edge of the image should be performed.<br/>
        /// The default value is <see cref="BorderEdgeMode2.Clamp"/>.
        /// </summary>
        public EffectPropertyAccessor<BorderEdgeMode2> EdgeMode => CreateEnumPropertyAccessor<BorderEdgeMode2>(3, edgeModeSpec);
        private static readonly EffectPropertyValueSpec edgeModeSpec = new EffectPropertyValueSpec(BorderEdgeMode2.Clamp, null, null);

        /// <summary>
        /// Specifies the alpha mode for the input and output.<br/>
        /// The default value is <see cref="AlphaMode.Premultiplied"/>.
        /// </summary>
        public EffectPropertyAccessor<AlphaMode> AlphaMode => CreateEnumPropertyAccessor<AlphaMode>(4, alphaModeSpec);
        private static readonly EffectPropertyValueSpec alphaModeSpec = new EffectPropertyValueSpec(PaintDotNet.Direct2D1.AlphaMode.Premultiplied, null, null);
    }

    internal sealed partial class Impl
        : CustomEffectImpl<Props>
    {
        private EffectTransform<ConvertAlphaEffect>? convertInputAlpha;
        private EffectTransform<BorderEffect2>? border;
        private EffectTransform<PixelShaderEffect<MinHorizontalShader>>? minValueH;
        private EffectTransform<PixelShaderEffect<MinVerticalShader>>? minValueV;
        private EffectTransform<PixelShaderEffect<MaxHorizontalShader>>? maxValueH;
        private EffectTransform<PixelShaderEffect<MaxVerticalShader>>? maxValueV;
        private EffectTransform<PixelShaderEffect<HiLoShader>>? hiLoShader;
        private EffectTransform<PixelShaderEffect<HiLoP0orP1Shader>>? hiLoP0orP1Shader;
        private EffectTransform<ConvertAlphaEffect>? convertOutputAlpha;

        public Impl()
        {
        }

        protected override void Dispose(bool disposing)
        {
            DisposableUtil.Free(ref this.convertInputAlpha);
            DisposableUtil.Free(ref this.border);
            DisposableUtil.Free(ref this.minValueH);
            DisposableUtil.Free(ref this.minValueV);
            DisposableUtil.Free(ref this.maxValueH);
            DisposableUtil.Free(ref this.maxValueV);
            DisposableUtil.Free(ref this.hiLoShader);
            DisposableUtil.Free(ref this.hiLoP0orP1Shader);
            DisposableUtil.Free(ref this.convertOutputAlpha);
            base.Dispose(disposing);
        }

        protected override void OnInitialize()
        {
            this.Properties.Radius.SetValue(10);
            this.Properties.Percentile.SetValue(0.5f);
            this.Properties.Iterations.SetValue(DefaultIterations);
            this.Properties.EdgeMode.SetValue(BorderEdgeMode2.Clamp);
            this.Properties.AlphaMode.SetValue(AlphaMode.Premultiplied);

            this.convertInputAlpha = this.TransformGraph.AddNode(new ConvertAlphaEffect(this.EffectContext));
            this.border = this.TransformGraph.AddNode(new BorderEffect2(this.EffectContext));
            this.minValueH = this.TransformGraph.AddNode(new PixelShaderEffect<MinHorizontalShader>(this.EffectContext));
            this.minValueV = this.TransformGraph.AddNode(new PixelShaderEffect<MinVerticalShader>(this.EffectContext));
            this.maxValueH = this.TransformGraph.AddNode(new PixelShaderEffect<MaxHorizontalShader>(this.EffectContext));
            this.maxValueV = this.TransformGraph.AddNode(new PixelShaderEffect<MaxVerticalShader>(this.EffectContext));
            this.hiLoShader = this.TransformGraph.AddNode(new PixelShaderEffect<HiLoShader>(this.EffectContext));
            this.hiLoP0orP1Shader = this.TransformGraph.AddNode(new PixelShaderEffect<HiLoP0orP1Shader>(this.EffectContext));
            this.convertOutputAlpha = this.TransformGraph.AddNode(new ConvertAlphaEffect(this.EffectContext));

            this.TransformGraph.ConnectToEffectInput(0, this.convertInputAlpha, 0);
            this.TransformGraph.ConnectNode(this.convertInputAlpha, this.border, 0);
            this.TransformGraph.ConnectNode(this.border, this.minValueH, 0);
            this.TransformGraph.ConnectNode(this.border, this.maxValueH, 0);
            this.TransformGraph.ConnectNode(this.minValueH, this.minValueV, 0);
            this.TransformGraph.ConnectNode(this.maxValueH, this.maxValueV, 0);
            this.TransformGraph.ConnectNode(this.border, this.hiLoShader, 0);
            this.TransformGraph.ConnectNode(this.minValueV, this.hiLoShader, 1);
            this.TransformGraph.ConnectNode(this.maxValueV, this.hiLoShader, 2);
            this.TransformGraph.ConnectNode(this.border, this.hiLoP0orP1Shader, 0);

            base.OnInitialize();
        }

        protected override void OnPrepareForRender(ChangeType changeType)
        {
            float radius = Math.Clamp(this.Properties.Radius.GetValue(), 0, MaxRadius);
            float percentile = Math.Clamp(this.Properties.Percentile.GetValue(), 0.0f, 1.0f);
            int iterations = Math.Clamp(this.Properties.Iterations.GetValue(), 1, MaxIterations);
            BorderEdgeMode2 edgeMode = this.Properties.EdgeMode.GetValue();
            AlphaMode alphaMode = this.Properties.AlphaMode.GetValue();

            if (radius <= 0)
            {
                this.TransformGraph.SetPassthroughGraph(0);
            }
            {
                this.convertInputAlpha!.Effect.Properties.Mode.SetValue(
                    alphaMode == AlphaMode.Straight ? ConvertAlphaMode.Passthrough : ConvertAlphaMode.UnPremultiply);

                this.border!.Effect.Properties.EdgeMode.SetValue((BorderEdgeMode2)edgeMode);

                int radiusI = Math.Clamp((int)Math.Ceiling(radius), 1, MaxRadius);
                RectInt32 samplingRect = RectInt32.FromEdges(-radiusI, -radiusI, radiusI, radiusI);

                this.minValueH!.Effect.Properties.Constants.SetValue(new MinHorizontalShader(radiusI));
                this.minValueV!.Effect.Properties.Constants.SetValue(new MinVerticalShader(radiusI));
                this.maxValueH!.Effect.Properties.Constants.SetValue(new MaxHorizontalShader(radiusI));
                this.maxValueV!.Effect.Properties.Constants.SetValue(new MaxVerticalShader(radiusI));

                using PooledNativeList<Vector4Float> samplingOffsetsRle = PooledNativeList<Vector4Float>.Get();
                int samplingArea = 0;
                int cutoffPow2 = ((radiusI * 2 + 1) * (radiusI * 2 + 1) + 2) / 4; // Produces a nicer looking circle than just r^2. Approximately (r+0.5)^2
                for (int dy = -radiusI; dy <= +radiusI; ++dy)
                {
                    int dxBegin = int.MaxValue;
                    int dxLength = 0;
                    for (int dx = -radiusI; dx <= +radiusI; ++dx)
                    {
                        if ((dx * dx + dy * dy) <= cutoffPow2)
                        {
                            dxBegin = Math.Min(dxBegin, dx);
                            ++dxLength;
                            ++samplingArea;
                        }
                    }

                    Debug.Assert(dxBegin != int.MaxValue && dxLength > 0);
                    samplingOffsetsRle.Add(new Vector4Float(
                        Unsafe.BitCast<int, float>(dxBegin),
                        Unsafe.BitCast<int, float>(dy),
                        Unsafe.BitCast<int, float>(dxLength),
                        0));
                }

                using ExtentPtrHandle<Vector4Float> samplingOffsetsRleHandle = samplingOffsetsRle.AcquireExtent();

                using PixelShaderResourceTexture1D samplingOffsetsRleResTex = new PixelShaderResourceTexture1D(
                    samplingOffsetsRleHandle.Extent,
                    TextureFilter.MinMagMipPoint,
                    ExtendMode.Clamp);

                if (percentile <= 0.0f || percentile >= 1.0f)
                {
                    // When percentile is 0, we use this in order to calculate the correct value.
                    // Otherwise we get the min value from the square neighborhood courtesy of MinShader
                    // instead of the circle kernel established by the sampling offsets array.
                    // When percentile is 1, we use this shader because it's a lot faster (>4x).
                    this.hiLoP0orP1Shader!.Effect.Properties.ResourceTexture(1).SetValue(samplingOffsetsRleResTex);
                    this.hiLoP0orP1Shader!.Effect.Properties.Constants.SetValue(new HiLoP0orP1Shader(
                        percentile <= 0.0f,
                        Unsafe.BitCast<RectInt32, int4>(samplingRect)));

                    this.TransformGraph.ConnectNode(this.hiLoP0orP1Shader!, this.convertOutputAlpha!, 0);
                }
                else
                {
                    this.hiLoShader!.Effect.Properties.ResourceTexture(3).SetValue(samplingOffsetsRleResTex);
                    this.hiLoShader!.Effect.Properties.Constants.SetValue(new HiLoShader(
                        (float)((double)samplingArea * percentile),
                        (uint)iterations,
                        Unsafe.BitCast<RectInt32, int4>(samplingRect)));

                    this.TransformGraph.ConnectNode(this.hiLoShader!, this.convertOutputAlpha!, 0);
                }

                this.convertOutputAlpha!.Effect.Properties.Mode.SetValue(
                    alphaMode == AlphaMode.Straight ? ConvertAlphaMode.Passthrough : ConvertAlphaMode.Premultiply);
                
                this.TransformGraph.SetOutputNode(this.convertOutputAlpha!);
            }

            base.OnPrepareForRender(changeType);
        }

        // Input0 = source image
        // Input1 = minimum value for kernel area
        // Input2 = maximum value for kernel area
        [D2DInputCount(3)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DInputSimple(1)]
        [D2DInputDescription(1, D2D1Filter.MinMagMipPoint)]
        [D2DInputSimple(2)]
        [D2DInputDescription(2, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct HiLoShader
            : IPixelShader<HiLoShader>
        {
            static IPixelShaderTransformImpl IPixelShader<HiLoShader>.CreateTransform(in HiLoShader shader)
            {
                if (shader.iterations > MaxIterations)
                {
                    throw new InternalErrorException($"iterations ({shader.iterations}) > {nameof(MaxIterations)} ({MaxIterations})");
                }

                return new HiLoShaderTransform(Unsafe.BitCast<int4, RectInt32>(shader.samplingRectXYWH));
            }

            private readonly float targetArea;
            private readonly uint iterations; // [1,MaxIterations]
            private readonly int4 samplingRectXYWH;

            // These are actually [dx, dy, len, 0] tuples of type int4
            // This ends up being a few percent faster than doing the dx,dy loop in the
            // shader and skipping pixels that are outside the radius cutoff.
            [AutoConstructorIgnore]
            [D2DResourceTextureIndex(3)]
            private readonly D2D1ResourceTexture1D<float4> samplingOffsetsRle;

            public float4 Execute()
            {
                HiLoState state;
                state.lo = D2D.GetInput(1);
                state.hi = D2D.GetInput(2);
                state.pivot = (state.lo + state.hi) / 2;

                HiLo(ref state);
#if SEARCH_ARITY_2 || SEARCH_ARITY_4 || SEARCH_ARITY_8 || SEARCH_ARITY_16
                if (this.iterations >= 2) HiLo(ref state);
                if (this.iterations >= 3) HiLo(ref state);
#if SEARCH_ARITY_2 || SEARCH_ARITY_4 || SEARCH_ARITY_8
                if (this.iterations >= 4) HiLo(ref state);
#if SEARCH_ARITY_2 || SEARCH_ARITY_4
                if (this.iterations >= 5) HiLo(ref state);
                if (this.iterations >= 6) HiLo(ref state);
#if SEARCH_ARITY_2
                if (this.iterations >= 7) HiLo(ref state);
                if (this.iterations >= 8) HiLo(ref state);
                if (this.iterations >= 9) HiLo(ref state);
                if (this.iterations >= 10) HiLo(ref state);
                if (this.iterations >= 11) HiLo(ref state);
                if (this.iterations >= 12) HiLo(ref state);
#endif
#endif
#endif
#endif

                return state.pivot;
            }

            private struct HiLoState
            {
                public float4 lo;
                public float4 pivot;
                public float4 hi;
            }

#if SEARCH_ARITY_2
            // Binary (2-ary) implementation
            private void HiLo(ref HiLoState state)
            {
                float4 m0 = state.lo;
                float4 m1 = state.pivot;
                float4 m2 = state.hi;

                float4 stepsM0 = 0;
                float4 stepsM1 = 0;

                int sorLength = this.samplingOffsetsRle.Width;
                for (int sori = 0; sori < sorLength; ++sori)
                {
                    int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                    float dy = dxdyLen.Y;
                    int dxEnd = dxdyLen.X + dxdyLen.Z;

                    for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                    {
                        float2 offset = new float2(dx, dy);
                        float4 sample = D2D.SampleInputAtOffset(0, offset);

                        stepsM0 += Hlsl.Step(sample, m0);
                        stepsM1 += Hlsl.Step(sample, m1);
                    }
                }

                bool4 isM01 = this.targetArea <= stepsM1;


                state.lo = Hlsl.Select(isM01, m0, m1);
                state.hi = Hlsl.Select(isM01, m1, m2);
                state.pivot = (state.lo + state.hi) / 2;
            }
#elif SEARCH_ARITY_4
            // Quaternary (4-ary) implementarion
            private void HiLo(ref HiLoState state)
            {
                float4 m0 = state.lo;
                float4 m1 = (state.lo + state.pivot) / 2;
                float4 m2 = state.pivot;
                float4 m3 = (state.pivot + state.hi) / 2;
                float4 m4 = state.hi;

                float4 stepsM0 = 0;
                float4 stepsM1 = 0;
                float4 stepsM2 = 0;
                float4 stepsM3 = 0;

                int sorLength = this.samplingOffsetsRle.Width;
                for (int sori = 0; sori < sorLength; ++sori)
                {
                    int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                    float dy = dxdyLen.Y;
                    int dxEnd = dxdyLen.X + dxdyLen.Z;

                    for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                    {
                        float2 offset = new float2(dx, dy);
                        float4 sample = D2D.SampleInputAtOffset(0, offset);

                        stepsM0 += Hlsl.Step(sample, m0);
                        stepsM1 += Hlsl.Step(sample, m1);
                        stepsM2 += Hlsl.Step(sample, m2);
                        stepsM3 += Hlsl.Step(sample, m3);
                    }
                }

                bool4 isM01 = this.targetArea <= stepsM1;
                bool4 isM12 = this.targetArea <= stepsM2;
                bool4 isM23 = this.targetArea <= stepsM3;

                state.lo =
                    Hlsl.Select(isM01, m0,
                    Hlsl.Select(isM12, m1,
                    Hlsl.Select(isM23, m2,
                                       m3)));

                state.hi =
                    Hlsl.Select(isM01, m1,
                    Hlsl.Select(isM12, m2,
                    Hlsl.Select(isM23, m3,
                                       m4)));

                state.pivot = (state.lo + state.hi) / 2;
            }
#elif SEARCH_ARITY_8
            // 8-ary
            private void HiLo(ref HiLoState state)
            {
                float4 m0 = state.lo;
                float4 m1 = (state.lo * 3 + state.pivot) / 4;
                float4 m2 = (state.lo + state.pivot) / 2;
                float4 m3 = (state.lo + state.pivot * 3) / 4;
                float4 m4 = state.pivot;
                float4 m5 = (state.pivot * 3 + state.hi) / 4;
                float4 m6 = (state.pivot + state.hi) / 2;
                float4 m7 = (state.pivot + state.hi * 3) / 4;
                float4 m8 = state.hi;

                float4 stepsM0 = 0;
                float4 stepsM1 = 0;
                float4 stepsM2 = 0;
                float4 stepsM3 = 0;
                float4 stepsM4 = 0;
                float4 stepsM5 = 0;
                float4 stepsM6 = 0;
                float4 stepsM7 = 0;

                int sorLength = this.samplingOffsetsRle.Width;
                for (int sori = 0; sori < sorLength; ++sori)
                {
                    int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                    float dy = dxdyLen.Y;
                    int dxEnd = dxdyLen.X + dxdyLen.Z;

                    for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                    {
                        float2 offset = new float2(dx, dy);
                        float4 sample = D2D.SampleInputAtOffset(0, offset);

                        stepsM0 += Hlsl.Step(sample, m0);
                        stepsM1 += Hlsl.Step(sample, m1);
                        stepsM2 += Hlsl.Step(sample, m2);
                        stepsM3 += Hlsl.Step(sample, m3);
                        stepsM4 += Hlsl.Step(sample, m4);
                        stepsM5 += Hlsl.Step(sample, m5);
                        stepsM6 += Hlsl.Step(sample, m6);
                        stepsM7 += Hlsl.Step(sample, m7);
                    }
                }

                bool4 isM01 = this.targetArea <= stepsM1;
                bool4 isM12 = this.targetArea <= stepsM2;
                bool4 isM23 = this.targetArea <= stepsM3;
                bool4 isM34 = this.targetArea <= stepsM4;
                bool4 isM45 = this.targetArea <= stepsM5;
                bool4 isM56 = this.targetArea <= stepsM6;
                bool4 isM67 = this.targetArea <= stepsM7;

                state.lo =
                    Hlsl.Select(isM01, m0,
                    Hlsl.Select(isM12, m1,
                    Hlsl.Select(isM23, m2,
                    Hlsl.Select(isM34, m3,
                    Hlsl.Select(isM45, m4,
                    Hlsl.Select(isM56, m5,
                    Hlsl.Select(isM67, m6,
                                       m7)))))));

                state.hi =
                    Hlsl.Select(isM01, m1,
                    Hlsl.Select(isM12, m2,
                    Hlsl.Select(isM23, m3,
                    Hlsl.Select(isM34, m4,
                    Hlsl.Select(isM45, m5,
                    Hlsl.Select(isM56, m6,
                    Hlsl.Select(isM67, m7,
                                       m8)))))));

                state.pivot = (state.lo + state.hi) / 2;
            }
#elif SEARCH_ARITY_16
            // 16-ary
            private void HiLo(ref HiLoState state)
            {
                float4 m0 = state.lo;
                float4 m1 = (state.lo * 7 + state.pivot * 1) / 8;
                float4 m2 = (state.lo * 6 + state.pivot * 2) / 8;
                float4 m3 = (state.lo * 5 + state.pivot * 3) / 8;
                float4 m4 = (state.lo * 4 + state.pivot * 4) / 8;
                float4 m5 = (state.lo * 3 + state.pivot * 5) / 8;
                float4 m6 = (state.lo * 2 + state.pivot * 6) / 8;
                float4 m7 = (state.lo * 1 + state.pivot * 7) / 8;
                float4 m8 = state.pivot;
                float4 m9 = (state.pivot * 7 + state.hi * 1) / 8;
                float4 mA = (state.pivot * 6 + state.hi * 2) / 8;
                float4 mB = (state.pivot * 5 + state.hi * 3) / 8;
                float4 mC = (state.pivot * 4 + state.hi * 4) / 8;
                float4 mD = (state.pivot * 3 + state.hi * 5) / 8;
                float4 mE = (state.pivot * 2 + state.hi * 6) / 8;
                float4 mF = (state.pivot * 1 + state.hi * 7) / 8;
                float4 mG = state.hi;

                float4 stepsM0 = 0;
                float4 stepsM1 = 0;
                float4 stepsM2 = 0;
                float4 stepsM3 = 0;
                float4 stepsM4 = 0;
                float4 stepsM5 = 0;
                float4 stepsM6 = 0;
                float4 stepsM7 = 0;
                float4 stepsM8 = 0;
                float4 stepsM9 = 0;
                float4 stepsMA = 0;
                float4 stepsMB = 0;
                float4 stepsMC = 0;
                float4 stepsMD = 0;
                float4 stepsME = 0;
                float4 stepsMF = 0;

                int sorLength = this.samplingOffsetsRle.Width;
                for (int sori = 0; sori < sorLength; ++sori)
                {
                    int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                    float dy = dxdyLen.Y;
                    int dxEnd = dxdyLen.X + dxdyLen.Z;

                    for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                    {
                        float2 offset = new float2(dx, dy);
                        float4 sample = D2D.SampleInputAtOffset(0, offset);

                        stepsM0 += Hlsl.Step(sample, m0);
                        stepsM1 += Hlsl.Step(sample, m1);
                        stepsM2 += Hlsl.Step(sample, m2);
                        stepsM3 += Hlsl.Step(sample, m3);
                        stepsM4 += Hlsl.Step(sample, m4);
                        stepsM5 += Hlsl.Step(sample, m5);
                        stepsM6 += Hlsl.Step(sample, m6);
                        stepsM7 += Hlsl.Step(sample, m7);
                        stepsM8 += Hlsl.Step(sample, m8);
                        stepsM9 += Hlsl.Step(sample, m9);
                        stepsMA += Hlsl.Step(sample, mA);
                        stepsMB += Hlsl.Step(sample, mB);
                        stepsMC += Hlsl.Step(sample, mC);
                        stepsMD += Hlsl.Step(sample, mD);
                        stepsME += Hlsl.Step(sample, mE);
                        stepsMF += Hlsl.Step(sample, mF);
                    }
                }

                bool4 isM01 = this.targetArea <= stepsM1;
                bool4 isM12 = this.targetArea <= stepsM2;
                bool4 isM23 = this.targetArea <= stepsM3;
                bool4 isM34 = this.targetArea <= stepsM4;
                bool4 isM45 = this.targetArea <= stepsM5;
                bool4 isM56 = this.targetArea <= stepsM6;
                bool4 isM67 = this.targetArea <= stepsM7;
                bool4 isM78 = this.targetArea <= stepsM8;
                bool4 isM89 = this.targetArea <= stepsM9;
                bool4 isM9A = this.targetArea <= stepsMA;
                bool4 isMAB = this.targetArea <= stepsMB;
                bool4 isMBC = this.targetArea <= stepsMC;
                bool4 isMCD = this.targetArea <= stepsMD;
                bool4 isMDE = this.targetArea <= stepsME;
                bool4 isMEF = this.targetArea <= stepsMF;

                state.lo =
                    Hlsl.Select(isM01, m0,
                    Hlsl.Select(isM12, m1,
                    Hlsl.Select(isM23, m2,
                    Hlsl.Select(isM34, m3,
                    Hlsl.Select(isM45, m4,
                    Hlsl.Select(isM56, m5,
                    Hlsl.Select(isM67, m6,
                    Hlsl.Select(isM78, m7,
                    Hlsl.Select(isM89, m8,
                    Hlsl.Select(isM9A, m9,
                    Hlsl.Select(isMAB, mA,
                    Hlsl.Select(isMBC, mB,
                    Hlsl.Select(isMCD, mC,
                    Hlsl.Select(isMDE, mD,
                    Hlsl.Select(isMEF, mE,
                                       mF)))))))))))))));

                state.hi =
                    Hlsl.Select(isM01, m1,
                    Hlsl.Select(isM12, m2,
                    Hlsl.Select(isM23, m3,
                    Hlsl.Select(isM34, m4,
                    Hlsl.Select(isM45, m5,
                    Hlsl.Select(isM56, m6,
                    Hlsl.Select(isM67, m7,
                    Hlsl.Select(isM78, m8,
                    Hlsl.Select(isM89, m9,
                    Hlsl.Select(isM9A, mA,
                    Hlsl.Select(isMAB, mB,
                    Hlsl.Select(isMBC, mC,
                    Hlsl.Select(isMCD, mD,
                    Hlsl.Select(isMDE, mE,
                    Hlsl.Select(isMEF, mF,
                                       mG)))))))))))))));

                state.pivot = (state.lo + state.hi) / 2;
            }
#else
    #pragma error Must #define SEARCH_ARITY_2, _4, _8, or _16
#endif
        }

        // Implementation of HiLoShader for when p=0 or p=1
        // It just returns the min or max value of the pixels within the kernel
        [D2DInputCount(1)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct HiLoP0orP1Shader
            : IPixelShader<HiLoP0orP1Shader>
        {
            static IPixelShaderTransformImpl IPixelShader<HiLoP0orP1Shader>.CreateTransform(in HiLoP0orP1Shader shader)
            {
                return new HiLoShaderTransform(Unsafe.BitCast<int4, RectInt32>(shader.samplingRectXYWH));
            }

            private readonly bool selectMinOrMax;
            private readonly int4 samplingRectXYWH;

            [AutoConstructorIgnore]
            [D2DResourceTextureIndex(1)]
            private readonly D2D1ResourceTexture1D<float4> samplingOffsetsRle;

            public float4 Execute()
            {
                float4 min = (float4)float.PositiveInfinity;
                float4 max = (float4)float.NegativeInfinity;

                int sorLength = this.samplingOffsetsRle.Width;
                for (int sori = 0; sori < sorLength; ++sori)
                {
                    int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                    float dy = dxdyLen.Y;
                    int dxEnd = dxdyLen.X + dxdyLen.Z;

                    for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                    {
                        float2 offset = new float2(dx, dy);
                        float4 sample = D2D.SampleInputAtOffset(0, offset);
                        min = Hlsl.Min(min, sample);
                        max = Hlsl.Max(max, sample);
                    }
                }

                return this.selectMinOrMax ? min : max;
            }
        }

        private sealed class HiLoShaderTransform
            : RefTrackedObject,
              IPixelShaderTransformImpl
        {
            private readonly RectInt32 samplingRect;

            public HiLoShaderTransform(RectInt32 samplingRect)
            {
                this.samplingRect = samplingRect;
            }

            public void MapInputRectsToOutputRect(
                ReadOnlySpan<RectInt32> inputRects,
                ReadOnlySpan<RectInt32> inputOpaqueSubRects,
                out RectInt32 outputRect,
                out RectInt32 outputOpaqueSubRect)
            {
                MapInvalidRect(0, inputRects[0], out outputRect);
                outputOpaqueSubRect = default;
            }

            public void MapOutputRectToInputRects(RectInt32 outputRect, Span<RectInt32> inputRects)
            {
                for (int i = 0; i < inputRects.Length; ++i)
                {
                    MapInvalidRect(i, outputRect, out inputRects[i]);
                }
            }

            public void MapInvalidRect(int inputIndex, RectInt32 invalidInputRect, out RectInt32 invalidOutputRect)
            {
                switch (inputIndex)
                {
                    case 0:
                        RectInt64 rect0 = new RectInt64(
                            (long)invalidInputRect.X + this.samplingRect.Left,
                            (long)invalidInputRect.Y + this.samplingRect.Top,
                            (long)invalidInputRect.Width + this.samplingRect.Width,
                            (long)invalidInputRect.Height + this.samplingRect.Height);
                        RectInt64 rect1 = RectInt64.Intersect(rect0, RectInt32.LogicallyInfinite);
                        invalidOutputRect = (RectInt32)rect1;
                        break;

                    case 1:
                    case 2:
                        invalidOutputRect = invalidInputRect;
                        break;

                    default:
                        throw new IndexOutOfRangeException();
                }
            }
        }

        [D2DInputCount(1)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct MinHorizontalShader
            : IPixelShader<MinHorizontalShader>
        {
            static IPixelShaderTransformImpl IPixelShader<MinHorizontalShader>.CreateTransform(in MinHorizontalShader shader)
            {
                return new RadiusRectTransform(shader.radius, 0);
            }

            private readonly int radius;

            public float4 Execute()
            {
                float4 min = (float4)float.PositiveInfinity;

                for (int dx = -this.radius; dx <= this.radius; ++dx)
                {
                    min = Hlsl.Min(min, D2D.SampleInputAtOffset(0, new float2(dx, 0)));
                }

                return min;
            }
        }

        [D2DInputCount(1)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct MinVerticalShader
            : IPixelShader<MinVerticalShader>
        {
            static IPixelShaderTransformImpl IPixelShader<MinVerticalShader>.CreateTransform(in MinVerticalShader shader)
            {
                return new RadiusRectTransform(0, shader.radius);
            }

            private readonly int radius;

            public float4 Execute()
            {
                float4 min = (float4)float.PositiveInfinity;

                for (int dy = -this.radius; dy <= this.radius; ++dy)
                {
                    min = Hlsl.Min(min, D2D.SampleInputAtOffset(0, new float2(0, dy)));
                }

                return min;
            }
        }

        [D2DInputCount(1)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct MaxHorizontalShader
            : IPixelShader<MaxHorizontalShader>
        {
            static IPixelShaderTransformImpl IPixelShader<MaxHorizontalShader>.CreateTransform(in MaxHorizontalShader shader)
            {
                return new RadiusRectTransform(shader.radius, 0);
            }

            private readonly int radius;

            public float4 Execute()
            {
                float4 max = (float4)float.NegativeInfinity;

                for (int dx = -this.radius; dx <= this.radius; ++dx)
                {
                    max = Hlsl.Max(max, D2D.SampleInputAtOffset(0, new float2(dx, 0)));
                }

                return max;
            }
        }

        [D2DInputCount(1)]
        [D2DInputComplex(0)]
        [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
        [D2DGeneratedPixelShaderDescriptor]
        [AutoConstructor]
        internal readonly partial struct MaxVerticalShader
            : IPixelShader<MaxVerticalShader>
        {
            static IPixelShaderTransformImpl IPixelShader<MaxVerticalShader>.CreateTransform(in MaxVerticalShader shader)
            {
                return new RadiusRectTransform(0, shader.radius);
            }

            private readonly int radius;

            public float4 Execute()
            {
                float4 max = (float4)float.NegativeInfinity;

                for (int dy = -this.radius; dy <= this.radius; ++dy)
                {
                    max = Hlsl.Max(max, D2D.SampleInputAtOffset(0, new float2(0, dy)));
                }

                return max;
            }
        }

        private sealed class RadiusRectTransform
            : RefTrackedObject,
              IPixelShaderTransformImpl
        {
            private readonly int radiusX;
            private readonly int radiusY;

            public RadiusRectTransform(int radiusX, int radiusY)
            {
                this.radiusX = radiusX;
                this.radiusY = radiusY;
            }

            public void MapInputRectsToOutputRect(
                ReadOnlySpan<RectInt32> inputRects,
                ReadOnlySpan<RectInt32> inputOpaqueSubRects,
                out RectInt32 outputRect,
                out RectInt32 outputOpaqueSubRect)
            {
                MapInvalidRect(0, inputRects[0], out outputRect);
                outputOpaqueSubRect = default;
            }

            public void MapInvalidRect(int inputIndex, RectInt32 invalidInputRect, out RectInt32 invalidOutputRect)
            {
                invalidOutputRect = InflateHelper(invalidInputRect, this.radiusX, this.radiusY);
            }

            public void MapOutputRectToInputRects(RectInt32 outputRect, Span<RectInt32> inputRects)
            {
                inputRects[0] = InflateHelper(outputRect, this.radiusX, this.radiusY);
            }

            private static RectInt32 InflateHelper(RectInt32 rect, int radiusX, int radiusY)
            {
                // First, do calculations at 64-bit to avoid overflow
                long left = rect.Left - radiusX;
                long top = rect.Top - radiusY;
                long right = rect.Right + radiusX;
                long bottom = rect.Bottom + radiusY;

                // Create a 64-bit rectangle
                RectInt64 result64 = RectInt64.FromEdges(left, top, right, bottom);

                // Clamp (intersect) the rectangle to the 32-bit "logically infinite" area, then cast do a 32-bit rectangle
                RectInt32 result = (RectInt32)RectInt64.Intersect(result64, RectInt32.LogicallyInfinite);

                return result;
            }
        }
    }
}

Rick Brewster · February 25

On 2/17/2024 at 9:50 AM, _koh_ said:

And histogram is one thing we should do in the storage format, which could be sRGB or anything.

I experimented with converting to/from linear space (e.g. WorkingSpaceLinear) -- and the results were substantially worse than with WorkingSpace. This is definitely an algorithm that should execute "within" the original color space.

_koh_ · February 26

8 hours ago, Rick Brewster said:

Instead of starting at a pivot point of c=0.5, I use the output of shaders that calculate the min and max for the neighborhood square kernel area. This then establishes the traditional lo, hi, and pivot values for the binary search.

That's neat. Never thought about pre-compute better pivot and start with it.
Haven't went thorough the code so I might be wrong, but basically when min-max of samples being < 1/2, we only need 7 tests instead of 8, when < 1/4 we only need 6 and so on. And we don't need exact min-max to narrow down the range, so you are doing square sampling instead of circle to do V->H optimization. Something like that I guess.

8 hours ago, Rick Brewster said:

Binary search provides 1-bit per iteration. I also implemented 4-ary, 8-ary, and 16-ary. I kept only 4-ary enabled because it has the best mix of performance and can reach 8-bits of output in 4 iterations (instead of 8 iterations w/ binary search).

Yeah, we can make sampling 1/n with 2^(n-1) registers. I considered about it but never tested.
2x registers means 1/2 threads GPU can fly, so 1/2, 1/3, 1/4 sampling with 1/2, 1/4, 1/8 threads. Going beyond n=2 unlikely worth it, but very possible n=2 is better than n=1 in general.

8 hours ago, Rick Brewster said:

This would enable the effect to run without monopolizing the GPU as much and would help to avoid causing major UI lag.

I'm just curious, but making it multi pass is better than using more tiles? PDN already doing tile rendering.

8 hours ago, Rick Brewster said:

This is definitely an algorithm that should execute "within" the original color space.

Yeah, while linear color space gives us more accurate results in general, doing histogram, dithering etc. in the original color space makes more sense.

Rick Brewster · February 26

11 hours ago, _koh_ said:

I'm just curious, but making it multi pass is better than using more tiles? PDN already doing tile rendering.

It may also be worth having PDN use smaller tiles in this case. I'm not sure whether it should be an option specified in OnInitializeRenderInfo(), or if PDN should somehow auto-detect that the effect is running "too slow" and automatically adjust downwards.

I think both should be used in this case. Using either of the two (multiple rendering passes, or smaller tiles) will help a lot, but lower-end hardware will really need both.

Here's how the tile size is calculated, based on the total image size:

Rick Brewster · February 26

12 hours ago, _koh_ said:

Haven't went thorough the code so I might be wrong, but basically when min-max of samples being < 1/2, we only need 7 tests instead of 8, when < 1/4 we only need 6 and so on.

The number of iterations is currently fixed, but that's an interesting idea

Rick Brewster · February 26

On 2/25/2024 at 1:44 PM, Rick Brewster said:

For my performance testing, I used an ~~18K x 12K~~ 12K x 8K image. I set radius to 100, percentile to 75, and then used either "Full" sampling (w/ your original shader), or the default iteration count (for my shaders). Your original shader took 30.7 second, while my 4-ary implementation takes 17.8 seconds (with higher quality!).

I was able to convert this to a compute shader that calculates 2 pixels at a time: 10.5 seconds 😁

Increasing that to 4 pixels reduced performance, likely because of occupancy spillage.

Rick Brewster · February 26

(Correction to data above: I've been using a 12K x 8K image for performance testing, not 18K x 12K)

Rick Brewster · February 27

4 hours ago, Rick Brewster said:

I was able to convert this to a compute shader that calculates 2 pixels at a time: 10.5 seconds 😁

I got it down to 9.5s by calculating 3 px at a time 😎

_koh_ · February 27

Seems like at least you can have 4x compute / fetch compared to the original shader for free.
What if you change arity=2 and keep output 4 pixels? That's another 4x compute / fetch setup I believe.

_koh_ · February 28

My last post sounds mess even for my english. haha
I was trying to say, looks like original shader idling 75% of the time because of bandwidth or latency, so doing 4x computing per sample and make loop iterations 1/4 might be the sweet spot.
I expect making any part of for(y) for(x) for(i) loop 1/4 has the same effect, but 1/4 i loop requires 8x computing per sample and pixel shader can't configure the xy loop.
I tested 1/2 i loop configuration and got 15% boost with it. My latest version doing INT8 sampling so not that much room left apparently.

edit:

I found performance ceiling on my GPU is rather 3x than 4x, so likely GPU clock dependent.
When 2GHz GPU is idling 75% of the time, 1.5GHz GPU is idling 66% of the time and such.

Edited March 1 by _koh_

_koh_ · March 8

Posting latest version with 2bit binary search, or more like quarter search.
source code + dll MedianFilterGPU.zip

Additionally I made versions which compute 2,3,4 pixel colors at once, then ran them on 1/2,1/3,1/4 sized images to estimate compute shader performance.

8K image, radius 100, sampling rate 1/4, RTX 3060 laptop

No optimization - 18.2s
INT8 sampling - 8.6s
2bit binary search - 10.2s
pseudo 2,3,4 pixel output - 10.2s, 8.2s, 7.8s
INT8 sampling + 2bit binary search - 7.2s
INT8 sampling + pseudo 2,3,4 pixel output - 6.9s

Looks like 2.6x original version is the performance ceiling on my GPU, and this latest version is at 2.5x.
2bit binary search need to test 3 thresholds inside of the loop to make loop iteration 1/2. Maybe that's why it runs slightly slower.

2bit binary search shader

private readonly partial struct Render : ID2D1PixelShader {
    private readonly float r, p;
    private readonly float3 d;

    private float4 HiLo(float4 c, float v) {
        float3x4 n = 0;
        float m = 0;
        float y = r % d.Y - r;
        for (; y <= r; y += d.Y) {
            float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y));
            float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w;
            for (; x <= w; x += d.X) {
                float4 s = D2D.SampleInputAtOffset(0, new(x, y));
                n += Hlsl.Step(new float3x4(s, s, s), new(c - v, c, c + v));
                m += 1;
            }
        }
        return (float3)1 * (1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100));
    }

    public float4 Execute() {
        float4 c = 0.5f;
        float  v = 0.5f;
        c += HiLo(c, v *= 0.5f) * (v *= 0.5f);
        c += HiLo(c, v *= 0.5f) * (v *= 0.5f);
        c += HiLo(c, v *= 0.5f) * (v *= 0.5f);
        c += HiLo(c, v *= 0.5f) * (v *= 0.5f);
        return c;
    }
}

pseudo 4 pixel output shader

private readonly partial struct Render : ID2D1PixelShader {
    private readonly float r, p;
    private readonly float3 d;

    private float4x4 HiLo(float4x4 c, float2 o) {
        float4x4 n = 0;
        float m = 0;
        float y = r % d.Y - r;
        for (; y <= r; y += d.Y) {
            float w = Hlsl.Trunc(Hlsl.Sqrt(r * r - y * y));
            float x = (w + r * d.X + y / d.Y * d.Z) % d.X - w;
            for (; x - d.X * 3 <= w; x += d.X) {
                // float4 s = input[(int2)(o + new float2(x, y))];
                float4 s = D2D.SampleInputAtPosition(0, o + new float2(x, y));
                float4 a = Hlsl.Step(Hlsl.Abs(x - d.X * new float4(0, 1, 2, 3)), w);
                n += new float4x4(a.X, a.Y, a.Z, a.W) * Hlsl.Step(new(s, s, s, s), c);
                m += a.X;
            }
        }
        return 1 - 2 * Hlsl.Step(Hlsl.Max(m * p, 1), n * 100);
    }

    public float4 Execute() {
        // float2 o = new(ThreadIds.X * 4 - ThreadIds.X % d.X * 3, ThreadIds.Y);
        float2 o = D2D.GetScenePosition().XY;
        float4x4 c = 0.5f;
        float v = 0.5f;
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        c += HiLo(c, o) * (v *= 0.5f);
        // output[(int2)(o + new float2(d.X * 0, 0))] = c[0];
        // output[(int2)(o + new float2(d.X * 1, 0))] = c[1];
        // output[(int2)(o + new float2(d.X * 2, 0))] = c[2];
        // output[(int2)(o + new float2(d.X * 3, 0))] = c[3];
        return (float4)1 / 4 * c;
    }
}

GPU Median Filter

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation