Jump to content

Rick Brewster

Administrator
  • Posts

    20,636
  • Joined

  • Last visited

  • Days Won

    376

Posts posted by Rick Brewster

  1. On 2/25/2024 at 1:44 PM, Rick Brewster said:

    For my performance testing, I used an 18K x 12K 12K x 8K image. I set radius to 100, percentile to 75, and then used either "Full" sampling (w/ your original shader), or the default iteration count (for my shaders). Your original shader took 30.7 second, while my 4-ary implementation takes 17.8 seconds (with higher quality!).

     

    I was able to convert this to a compute shader that calculates 2 pixels at a time: 10.5 seconds 😁 

     

    Increasing that to 4 pixels reduced performance, likely because of occupancy spillage.

  2. 12 hours ago, _koh_ said:

    Haven't went thorough the code so I might be wrong, but basically when min-max of samples being < 1/2, we only need 7 tests instead of 8, when < 1/4 we only need 6 and so on.

    The number of iterations is currently fixed, but that's an interesting idea

  3. 11 hours ago, _koh_ said:

    I'm just curious, but making it multi pass is better than using more tiles? PDN already doing tile rendering.

    It may also be worth having PDN use smaller tiles in this case. I'm not sure whether it should be an option specified in OnInitializeRenderInfo(), or if PDN should somehow auto-detect that the effect is running "too slow" and automatically adjust downwards.

     

    I think both should be used in this case. Using either of the two (multiple rendering passes, or smaller tiles) will help a lot, but lower-end hardware will really need both.

     

    Here's how the tile size is calculated, based on the total image size:

     

    image.png

  4. On 2/17/2024 at 9:50 AM, _koh_ said:

    And histogram is one thing we should do in the storage format, which could be sRGB or anything.

     

    I experimented with converting to/from linear space (e.g. WorkingSpaceLinear) -- and the results were substantially worse than with WorkingSpace. This is definitely an algorithm that should execute "within" the original color space.

  5. I've been able to optimize this further vs. the original shader (at full sampling) in this post, cutting down the execution time by about 42% -- without using a compute shader (although that's my next step!) and while also improving quality.

     

    Here's how I did it:

    1. Instead of starting at a pivot point of c=0.5, I use the output of shaders that calculate the min and max for the neighborhood square kernel area. This then establishes the traditional lo, hi, and pivot values for the binary search. This is less precise than taking the min/max of the circle kernel area, but can execute substantially faster (linear instead of quadratic) because these are separable kernels. This has two side effects: 1) it increases precision in areas that have a smaller dynamic range, and 2) it supports any dynamic range of values, not just those that are [0,1].
    2. Binary search provides 1-bit per iteration. I also implemented 4-ary, 8-ary, and 16-ary. I kept only 4-ary enabled because it has the best mix of performance and can reach 8-bits of output in 4 iterations (instead of 8 iterations w/ binary search). The 8-ary can hit 9-bits in 3 iterations, which is more than we need. The 16-ary can hit 8-bits in 2 iterations but because it's using so many registers it actually runs slower due to reduced shader occupancy.
    3. The search now produces the wrong result when percentile=0 because they can only output the value from the localized min shader, which is often providing the min value for a pixel outside of the circular kernel. This means you get "squares" instead of "circles" in the output. I special-case this to use a different shader that finds the minimum value within the circular kernel. It's possible to incorporate this logic into the regular n-ary shader methods, but it significantly reduces performance.

    For my performance testing, I used an 18K x 12K 12K x 8K image. I set radius to 100, percentile to 75, and then used either "Full" sampling (w/ your original shader), or the default iteration count (for my shaders). Your original shader took 30.7 second, while my 4-ary implementation takes 17.8 seconds (with higher quality!).

     

    The next steps for optimization would seem to be using a compute shader, which could calculate multiple output pixels at once. This should be able to bring that 17.8 down even further, meaning this might even be shippable as a built-in PDN effect! And a quality slider that chooses full vs. half vs. etc. sampling would also enable faster performance (like your shader does).

     

    I'd also like to separate each iteration of the algorithm into its own rendering pass. This would definitely require a compute shader, as it would need to write out 2 additional float4s in order to provide the hi/lo markers (so the output image would be 3x the width of the input image, and then a final shader would discard those 2 extra values). This would enable the effect to run without monopolizing the GPU as much and would help to avoid causing major UI lag. I don't think it would improve performance, but I need to see how it goes.

     

    Here's the code I've got so far. It's using some PDN internal stuff (like PixelShaderEffect<T>), but you can still translate it to not use the internal stuff.

     

    Spoiler

     

    //#define SEARCH_ARITY_2
    #define SEARCH_ARITY_4
    //#define SEARCH_ARITY_8
    //#define SEARCH_ARITY_16
    
    using ComputeSharp;
    using ComputeSharp.D2D1;
    using PaintDotNet.Collections;
    using PaintDotNet.ComponentModel;
    using PaintDotNet.Direct2D1;
    using PaintDotNet.Direct2D1.Effects;
    using PaintDotNet.Rendering;
    using System;
    using System.Diagnostics;
    using System.Runtime.CompilerServices;
    
    namespace PaintDotNet.Effects.Gpu;
    
    // TODO: Describe algorithm
    // TODO: doc comments
    public sealed partial class PdnMedianEffect
        : CustomEffect<PdnMedianEffect.Props>
    {
        internal const int MaxRadius = 100;
    
        // Allow slider to go up to "12 bits" worth of precision. We actually get more effective bits due to
        // other particulars of the algorithm, but this is a good stopping point that allows them all to line up.
        // 2-ary search generates 1 bit per iteration, 4-ary search generates 2 bits, 8-ary search generates
        // 3 bits, and 16-ary search generates 4-bits.
        internal const int MaxIterations =
    #if SEARCH_ARITY_2
            12
    #elif SEARCH_ARITY_4
            6
    #elif SEARCH_ARITY_8
            4
    #elif SEARCH_ARITY_16
            3
    #endif
            ;
    
        // The default iterations is for either 8-bits or 9-bits (for arity 8) 
        internal const int DefaultIterations = (MaxIterations * 2) / 3; // 8, 4, 3, 2
    
        public PdnMedianEffect(IDeviceEffectFactory factory)
            : base(factory)
        {
        }
    
        public sealed class Props
            : CustomEffectProperties
        {
            protected override CustomEffectImpl CreateImpl()
            {
                return new Impl();
            }
    
            public EffectInputAccessor Input => CreateInputAccessor(0);
    
            /// <summary>
            /// The radius of the effect. A value of 0 will disable the effect.<br/>
            /// Performance scales with the square of this property value. Doubling the radius will quadruple the rendering time.<br/>
            /// The valid range is [0, 100], the default is 25.
            /// </summary>
            public EffectPropertyAccessor<float> Radius => CreateFloatPropertyAccessor(0, radiusSpec);
            private static readonly EffectPropertyValueSpec radiusSpec = new EffectPropertyValueSpec(25.0f, 0.0f, MedianApproximationEffect.MaxRadius);
    
            /// <summary>
            /// Specifies the percentile to use when approximating the median. Lower values result in darkened/eroded results,
            /// while higher values result in brightened/dilated results.<br/>
            /// The range is [0,1] which correspond's to the UI's range of [0, 100]. The default value is 0.5f.
            /// </summary>
            public EffectPropertyAccessor<float> Percentile => CreateFloatPropertyAccessor(1, percentileSpec);
            private static readonly EffectPropertyValueSpec percentileSpec = new EffectPropertyValueSpec(0.5f, 0.0f, 1.0f);
    
            // TODO: doc comment, value spec
            public EffectPropertyAccessor<int> Iterations => CreateInt32PropertyAccessor(2); // [1, MaxIterations]
    
            /// <summary>
            /// Specifies how sampling beyond the edge of the image should be performed.<br/>
            /// The default value is <see cref="BorderEdgeMode2.Clamp"/>.
            /// </summary>
            public EffectPropertyAccessor<BorderEdgeMode2> EdgeMode => CreateEnumPropertyAccessor<BorderEdgeMode2>(3, edgeModeSpec);
            private static readonly EffectPropertyValueSpec edgeModeSpec = new EffectPropertyValueSpec(BorderEdgeMode2.Clamp, null, null);
    
            /// <summary>
            /// Specifies the alpha mode for the input and output.<br/>
            /// The default value is <see cref="AlphaMode.Premultiplied"/>.
            /// </summary>
            public EffectPropertyAccessor<AlphaMode> AlphaMode => CreateEnumPropertyAccessor<AlphaMode>(4, alphaModeSpec);
            private static readonly EffectPropertyValueSpec alphaModeSpec = new EffectPropertyValueSpec(PaintDotNet.Direct2D1.AlphaMode.Premultiplied, null, null);
        }
    
        internal sealed partial class Impl
            : CustomEffectImpl<Props>
        {
            private EffectTransform<ConvertAlphaEffect>? convertInputAlpha;
            private EffectTransform<BorderEffect2>? border;
            private EffectTransform<PixelShaderEffect<MinHorizontalShader>>? minValueH;
            private EffectTransform<PixelShaderEffect<MinVerticalShader>>? minValueV;
            private EffectTransform<PixelShaderEffect<MaxHorizontalShader>>? maxValueH;
            private EffectTransform<PixelShaderEffect<MaxVerticalShader>>? maxValueV;
            private EffectTransform<PixelShaderEffect<HiLoShader>>? hiLoShader;
            private EffectTransform<PixelShaderEffect<HiLoP0orP1Shader>>? hiLoP0orP1Shader;
            private EffectTransform<ConvertAlphaEffect>? convertOutputAlpha;
    
            public Impl()
            {
            }
    
            protected override void Dispose(bool disposing)
            {
                DisposableUtil.Free(ref this.convertInputAlpha);
                DisposableUtil.Free(ref this.border);
                DisposableUtil.Free(ref this.minValueH);
                DisposableUtil.Free(ref this.minValueV);
                DisposableUtil.Free(ref this.maxValueH);
                DisposableUtil.Free(ref this.maxValueV);
                DisposableUtil.Free(ref this.hiLoShader);
                DisposableUtil.Free(ref this.hiLoP0orP1Shader);
                DisposableUtil.Free(ref this.convertOutputAlpha);
                base.Dispose(disposing);
            }
    
            protected override void OnInitialize()
            {
                this.Properties.Radius.SetValue(10);
                this.Properties.Percentile.SetValue(0.5f);
                this.Properties.Iterations.SetValue(DefaultIterations);
                this.Properties.EdgeMode.SetValue(BorderEdgeMode2.Clamp);
                this.Properties.AlphaMode.SetValue(AlphaMode.Premultiplied);
    
                this.convertInputAlpha = this.TransformGraph.AddNode(new ConvertAlphaEffect(this.EffectContext));
                this.border = this.TransformGraph.AddNode(new BorderEffect2(this.EffectContext));
                this.minValueH = this.TransformGraph.AddNode(new PixelShaderEffect<MinHorizontalShader>(this.EffectContext));
                this.minValueV = this.TransformGraph.AddNode(new PixelShaderEffect<MinVerticalShader>(this.EffectContext));
                this.maxValueH = this.TransformGraph.AddNode(new PixelShaderEffect<MaxHorizontalShader>(this.EffectContext));
                this.maxValueV = this.TransformGraph.AddNode(new PixelShaderEffect<MaxVerticalShader>(this.EffectContext));
                this.hiLoShader = this.TransformGraph.AddNode(new PixelShaderEffect<HiLoShader>(this.EffectContext));
                this.hiLoP0orP1Shader = this.TransformGraph.AddNode(new PixelShaderEffect<HiLoP0orP1Shader>(this.EffectContext));
                this.convertOutputAlpha = this.TransformGraph.AddNode(new ConvertAlphaEffect(this.EffectContext));
    
                this.TransformGraph.ConnectToEffectInput(0, this.convertInputAlpha, 0);
                this.TransformGraph.ConnectNode(this.convertInputAlpha, this.border, 0);
                this.TransformGraph.ConnectNode(this.border, this.minValueH, 0);
                this.TransformGraph.ConnectNode(this.border, this.maxValueH, 0);
                this.TransformGraph.ConnectNode(this.minValueH, this.minValueV, 0);
                this.TransformGraph.ConnectNode(this.maxValueH, this.maxValueV, 0);
                this.TransformGraph.ConnectNode(this.border, this.hiLoShader, 0);
                this.TransformGraph.ConnectNode(this.minValueV, this.hiLoShader, 1);
                this.TransformGraph.ConnectNode(this.maxValueV, this.hiLoShader, 2);
                this.TransformGraph.ConnectNode(this.border, this.hiLoP0orP1Shader, 0);
    
                base.OnInitialize();
            }
    
            protected override void OnPrepareForRender(ChangeType changeType)
            {
                float radius = Math.Clamp(this.Properties.Radius.GetValue(), 0, MaxRadius);
                float percentile = Math.Clamp(this.Properties.Percentile.GetValue(), 0.0f, 1.0f);
                int iterations = Math.Clamp(this.Properties.Iterations.GetValue(), 1, MaxIterations);
                BorderEdgeMode2 edgeMode = this.Properties.EdgeMode.GetValue();
                AlphaMode alphaMode = this.Properties.AlphaMode.GetValue();
    
                if (radius <= 0)
                {
                    this.TransformGraph.SetPassthroughGraph(0);
                }
                {
                    this.convertInputAlpha!.Effect.Properties.Mode.SetValue(
                        alphaMode == AlphaMode.Straight ? ConvertAlphaMode.Passthrough : ConvertAlphaMode.UnPremultiply);
    
                    this.border!.Effect.Properties.EdgeMode.SetValue((BorderEdgeMode2)edgeMode);
    
                    int radiusI = Math.Clamp((int)Math.Ceiling(radius), 1, MaxRadius);
                    RectInt32 samplingRect = RectInt32.FromEdges(-radiusI, -radiusI, radiusI, radiusI);
    
                    this.minValueH!.Effect.Properties.Constants.SetValue(new MinHorizontalShader(radiusI));
                    this.minValueV!.Effect.Properties.Constants.SetValue(new MinVerticalShader(radiusI));
                    this.maxValueH!.Effect.Properties.Constants.SetValue(new MaxHorizontalShader(radiusI));
                    this.maxValueV!.Effect.Properties.Constants.SetValue(new MaxVerticalShader(radiusI));
    
                    using PooledNativeList<Vector4Float> samplingOffsetsRle = PooledNativeList<Vector4Float>.Get();
                    int samplingArea = 0;
                    int cutoffPow2 = ((radiusI * 2 + 1) * (radiusI * 2 + 1) + 2) / 4; // Produces a nicer looking circle than just r^2. Approximately (r+0.5)^2
                    for (int dy = -radiusI; dy <= +radiusI; ++dy)
                    {
                        int dxBegin = int.MaxValue;
                        int dxLength = 0;
                        for (int dx = -radiusI; dx <= +radiusI; ++dx)
                        {
                            if ((dx * dx + dy * dy) <= cutoffPow2)
                            {
                                dxBegin = Math.Min(dxBegin, dx);
                                ++dxLength;
                                ++samplingArea;
                            }
                        }
    
                        Debug.Assert(dxBegin != int.MaxValue && dxLength > 0);
                        samplingOffsetsRle.Add(new Vector4Float(
                            Unsafe.BitCast<int, float>(dxBegin),
                            Unsafe.BitCast<int, float>(dy),
                            Unsafe.BitCast<int, float>(dxLength),
                            0));
                    }
    
                    using ExtentPtrHandle<Vector4Float> samplingOffsetsRleHandle = samplingOffsetsRle.AcquireExtent();
    
                    using PixelShaderResourceTexture1D samplingOffsetsRleResTex = new PixelShaderResourceTexture1D(
                        samplingOffsetsRleHandle.Extent,
                        TextureFilter.MinMagMipPoint,
                        ExtendMode.Clamp);
    
                    if (percentile <= 0.0f || percentile >= 1.0f)
                    {
                        // When percentile is 0, we use this in order to calculate the correct value.
                        // Otherwise we get the min value from the square neighborhood courtesy of MinShader
                        // instead of the circle kernel established by the sampling offsets array.
                        // When percentile is 1, we use this shader because it's a lot faster (>4x).
                        this.hiLoP0orP1Shader!.Effect.Properties.ResourceTexture(1).SetValue(samplingOffsetsRleResTex);
                        this.hiLoP0orP1Shader!.Effect.Properties.Constants.SetValue(new HiLoP0orP1Shader(
                            percentile <= 0.0f,
                            Unsafe.BitCast<RectInt32, int4>(samplingRect)));
    
                        this.TransformGraph.ConnectNode(this.hiLoP0orP1Shader!, this.convertOutputAlpha!, 0);
                    }
                    else
                    {
                        this.hiLoShader!.Effect.Properties.ResourceTexture(3).SetValue(samplingOffsetsRleResTex);
                        this.hiLoShader!.Effect.Properties.Constants.SetValue(new HiLoShader(
                            (float)((double)samplingArea * percentile),
                            (uint)iterations,
                            Unsafe.BitCast<RectInt32, int4>(samplingRect)));
    
                        this.TransformGraph.ConnectNode(this.hiLoShader!, this.convertOutputAlpha!, 0);
                    }
    
                    this.convertOutputAlpha!.Effect.Properties.Mode.SetValue(
                        alphaMode == AlphaMode.Straight ? ConvertAlphaMode.Passthrough : ConvertAlphaMode.Premultiply);
                    
                    this.TransformGraph.SetOutputNode(this.convertOutputAlpha!);
                }
    
                base.OnPrepareForRender(changeType);
            }
    
            // Input0 = source image
            // Input1 = minimum value for kernel area
            // Input2 = maximum value for kernel area
            [D2DInputCount(3)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DInputSimple(1)]
            [D2DInputDescription(1, D2D1Filter.MinMagMipPoint)]
            [D2DInputSimple(2)]
            [D2DInputDescription(2, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct HiLoShader
                : IPixelShader<HiLoShader>
            {
                static IPixelShaderTransformImpl IPixelShader<HiLoShader>.CreateTransform(in HiLoShader shader)
                {
                    if (shader.iterations > MaxIterations)
                    {
                        throw new InternalErrorException($"iterations ({shader.iterations}) > {nameof(MaxIterations)} ({MaxIterations})");
                    }
    
                    return new HiLoShaderTransform(Unsafe.BitCast<int4, RectInt32>(shader.samplingRectXYWH));
                }
    
                private readonly float targetArea;
                private readonly uint iterations; // [1,MaxIterations]
                private readonly int4 samplingRectXYWH;
    
                // These are actually [dx, dy, len, 0] tuples of type int4
                // This ends up being a few percent faster than doing the dx,dy loop in the
                // shader and skipping pixels that are outside the radius cutoff.
                [AutoConstructorIgnore]
                [D2DResourceTextureIndex(3)]
                private readonly D2D1ResourceTexture1D<float4> samplingOffsetsRle;
    
                public float4 Execute()
                {
                    HiLoState state;
                    state.lo = D2D.GetInput(1);
                    state.hi = D2D.GetInput(2);
                    state.pivot = (state.lo + state.hi) / 2;
    
                    HiLo(ref state);
    #if SEARCH_ARITY_2 || SEARCH_ARITY_4 || SEARCH_ARITY_8 || SEARCH_ARITY_16
                    if (this.iterations >= 2) HiLo(ref state);
                    if (this.iterations >= 3) HiLo(ref state);
    #if SEARCH_ARITY_2 || SEARCH_ARITY_4 || SEARCH_ARITY_8
                    if (this.iterations >= 4) HiLo(ref state);
    #if SEARCH_ARITY_2 || SEARCH_ARITY_4
                    if (this.iterations >= 5) HiLo(ref state);
                    if (this.iterations >= 6) HiLo(ref state);
    #if SEARCH_ARITY_2
                    if (this.iterations >= 7) HiLo(ref state);
                    if (this.iterations >= 8) HiLo(ref state);
                    if (this.iterations >= 9) HiLo(ref state);
                    if (this.iterations >= 10) HiLo(ref state);
                    if (this.iterations >= 11) HiLo(ref state);
                    if (this.iterations >= 12) HiLo(ref state);
    #endif
    #endif
    #endif
    #endif
    
                    return state.pivot;
                }
    
                private struct HiLoState
                {
                    public float4 lo;
                    public float4 pivot;
                    public float4 hi;
                }
    
    #if SEARCH_ARITY_2
                // Binary (2-ary) implementation
                private void HiLo(ref HiLoState state)
                {
                    float4 m0 = state.lo;
                    float4 m1 = state.pivot;
                    float4 m2 = state.hi;
    
                    float4 stepsM0 = 0;
                    float4 stepsM1 = 0;
    
                    int sorLength = this.samplingOffsetsRle.Width;
                    for (int sori = 0; sori < sorLength; ++sori)
                    {
                        int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                        float dy = dxdyLen.Y;
                        int dxEnd = dxdyLen.X + dxdyLen.Z;
    
                        for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                        {
                            float2 offset = new float2(dx, dy);
                            float4 sample = D2D.SampleInputAtOffset(0, offset);
    
                            stepsM0 += Hlsl.Step(sample, m0);
                            stepsM1 += Hlsl.Step(sample, m1);
                        }
                    }
    
                    bool4 isM01 = this.targetArea <= stepsM1;
    
    
                    state.lo = Hlsl.Select(isM01, m0, m1);
                    state.hi = Hlsl.Select(isM01, m1, m2);
                    state.pivot = (state.lo + state.hi) / 2;
                }
    #elif SEARCH_ARITY_4
                // Quaternary (4-ary) implementarion
                private void HiLo(ref HiLoState state)
                {
                    float4 m0 = state.lo;
                    float4 m1 = (state.lo + state.pivot) / 2;
                    float4 m2 = state.pivot;
                    float4 m3 = (state.pivot + state.hi) / 2;
                    float4 m4 = state.hi;
    
                    float4 stepsM0 = 0;
                    float4 stepsM1 = 0;
                    float4 stepsM2 = 0;
                    float4 stepsM3 = 0;
    
                    int sorLength = this.samplingOffsetsRle.Width;
                    for (int sori = 0; sori < sorLength; ++sori)
                    {
                        int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                        float dy = dxdyLen.Y;
                        int dxEnd = dxdyLen.X + dxdyLen.Z;
    
                        for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                        {
                            float2 offset = new float2(dx, dy);
                            float4 sample = D2D.SampleInputAtOffset(0, offset);
    
                            stepsM0 += Hlsl.Step(sample, m0);
                            stepsM1 += Hlsl.Step(sample, m1);
                            stepsM2 += Hlsl.Step(sample, m2);
                            stepsM3 += Hlsl.Step(sample, m3);
                        }
                    }
    
                    bool4 isM01 = this.targetArea <= stepsM1;
                    bool4 isM12 = this.targetArea <= stepsM2;
                    bool4 isM23 = this.targetArea <= stepsM3;
    
                    state.lo =
                        Hlsl.Select(isM01, m0,
                        Hlsl.Select(isM12, m1,
                        Hlsl.Select(isM23, m2,
                                           m3)));
    
                    state.hi =
                        Hlsl.Select(isM01, m1,
                        Hlsl.Select(isM12, m2,
                        Hlsl.Select(isM23, m3,
                                           m4)));
    
                    state.pivot = (state.lo + state.hi) / 2;
                }
    #elif SEARCH_ARITY_8
                // 8-ary
                private void HiLo(ref HiLoState state)
                {
                    float4 m0 = state.lo;
                    float4 m1 = (state.lo * 3 + state.pivot) / 4;
                    float4 m2 = (state.lo + state.pivot) / 2;
                    float4 m3 = (state.lo + state.pivot * 3) / 4;
                    float4 m4 = state.pivot;
                    float4 m5 = (state.pivot * 3 + state.hi) / 4;
                    float4 m6 = (state.pivot + state.hi) / 2;
                    float4 m7 = (state.pivot + state.hi * 3) / 4;
                    float4 m8 = state.hi;
    
                    float4 stepsM0 = 0;
                    float4 stepsM1 = 0;
                    float4 stepsM2 = 0;
                    float4 stepsM3 = 0;
                    float4 stepsM4 = 0;
                    float4 stepsM5 = 0;
                    float4 stepsM6 = 0;
                    float4 stepsM7 = 0;
    
                    int sorLength = this.samplingOffsetsRle.Width;
                    for (int sori = 0; sori < sorLength; ++sori)
                    {
                        int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                        float dy = dxdyLen.Y;
                        int dxEnd = dxdyLen.X + dxdyLen.Z;
    
                        for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                        {
                            float2 offset = new float2(dx, dy);
                            float4 sample = D2D.SampleInputAtOffset(0, offset);
    
                            stepsM0 += Hlsl.Step(sample, m0);
                            stepsM1 += Hlsl.Step(sample, m1);
                            stepsM2 += Hlsl.Step(sample, m2);
                            stepsM3 += Hlsl.Step(sample, m3);
                            stepsM4 += Hlsl.Step(sample, m4);
                            stepsM5 += Hlsl.Step(sample, m5);
                            stepsM6 += Hlsl.Step(sample, m6);
                            stepsM7 += Hlsl.Step(sample, m7);
                        }
                    }
    
                    bool4 isM01 = this.targetArea <= stepsM1;
                    bool4 isM12 = this.targetArea <= stepsM2;
                    bool4 isM23 = this.targetArea <= stepsM3;
                    bool4 isM34 = this.targetArea <= stepsM4;
                    bool4 isM45 = this.targetArea <= stepsM5;
                    bool4 isM56 = this.targetArea <= stepsM6;
                    bool4 isM67 = this.targetArea <= stepsM7;
    
                    state.lo =
                        Hlsl.Select(isM01, m0,
                        Hlsl.Select(isM12, m1,
                        Hlsl.Select(isM23, m2,
                        Hlsl.Select(isM34, m3,
                        Hlsl.Select(isM45, m4,
                        Hlsl.Select(isM56, m5,
                        Hlsl.Select(isM67, m6,
                                           m7)))))));
    
                    state.hi =
                        Hlsl.Select(isM01, m1,
                        Hlsl.Select(isM12, m2,
                        Hlsl.Select(isM23, m3,
                        Hlsl.Select(isM34, m4,
                        Hlsl.Select(isM45, m5,
                        Hlsl.Select(isM56, m6,
                        Hlsl.Select(isM67, m7,
                                           m8)))))));
    
                    state.pivot = (state.lo + state.hi) / 2;
                }
    #elif SEARCH_ARITY_16
                // 16-ary
                private void HiLo(ref HiLoState state)
                {
                    float4 m0 = state.lo;
                    float4 m1 = (state.lo * 7 + state.pivot * 1) / 8;
                    float4 m2 = (state.lo * 6 + state.pivot * 2) / 8;
                    float4 m3 = (state.lo * 5 + state.pivot * 3) / 8;
                    float4 m4 = (state.lo * 4 + state.pivot * 4) / 8;
                    float4 m5 = (state.lo * 3 + state.pivot * 5) / 8;
                    float4 m6 = (state.lo * 2 + state.pivot * 6) / 8;
                    float4 m7 = (state.lo * 1 + state.pivot * 7) / 8;
                    float4 m8 = state.pivot;
                    float4 m9 = (state.pivot * 7 + state.hi * 1) / 8;
                    float4 mA = (state.pivot * 6 + state.hi * 2) / 8;
                    float4 mB = (state.pivot * 5 + state.hi * 3) / 8;
                    float4 mC = (state.pivot * 4 + state.hi * 4) / 8;
                    float4 mD = (state.pivot * 3 + state.hi * 5) / 8;
                    float4 mE = (state.pivot * 2 + state.hi * 6) / 8;
                    float4 mF = (state.pivot * 1 + state.hi * 7) / 8;
                    float4 mG = state.hi;
    
                    float4 stepsM0 = 0;
                    float4 stepsM1 = 0;
                    float4 stepsM2 = 0;
                    float4 stepsM3 = 0;
                    float4 stepsM4 = 0;
                    float4 stepsM5 = 0;
                    float4 stepsM6 = 0;
                    float4 stepsM7 = 0;
                    float4 stepsM8 = 0;
                    float4 stepsM9 = 0;
                    float4 stepsMA = 0;
                    float4 stepsMB = 0;
                    float4 stepsMC = 0;
                    float4 stepsMD = 0;
                    float4 stepsME = 0;
                    float4 stepsMF = 0;
    
                    int sorLength = this.samplingOffsetsRle.Width;
                    for (int sori = 0; sori < sorLength; ++sori)
                    {
                        int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                        float dy = dxdyLen.Y;
                        int dxEnd = dxdyLen.X + dxdyLen.Z;
    
                        for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                        {
                            float2 offset = new float2(dx, dy);
                            float4 sample = D2D.SampleInputAtOffset(0, offset);
    
                            stepsM0 += Hlsl.Step(sample, m0);
                            stepsM1 += Hlsl.Step(sample, m1);
                            stepsM2 += Hlsl.Step(sample, m2);
                            stepsM3 += Hlsl.Step(sample, m3);
                            stepsM4 += Hlsl.Step(sample, m4);
                            stepsM5 += Hlsl.Step(sample, m5);
                            stepsM6 += Hlsl.Step(sample, m6);
                            stepsM7 += Hlsl.Step(sample, m7);
                            stepsM8 += Hlsl.Step(sample, m8);
                            stepsM9 += Hlsl.Step(sample, m9);
                            stepsMA += Hlsl.Step(sample, mA);
                            stepsMB += Hlsl.Step(sample, mB);
                            stepsMC += Hlsl.Step(sample, mC);
                            stepsMD += Hlsl.Step(sample, mD);
                            stepsME += Hlsl.Step(sample, mE);
                            stepsMF += Hlsl.Step(sample, mF);
                        }
                    }
    
                    bool4 isM01 = this.targetArea <= stepsM1;
                    bool4 isM12 = this.targetArea <= stepsM2;
                    bool4 isM23 = this.targetArea <= stepsM3;
                    bool4 isM34 = this.targetArea <= stepsM4;
                    bool4 isM45 = this.targetArea <= stepsM5;
                    bool4 isM56 = this.targetArea <= stepsM6;
                    bool4 isM67 = this.targetArea <= stepsM7;
                    bool4 isM78 = this.targetArea <= stepsM8;
                    bool4 isM89 = this.targetArea <= stepsM9;
                    bool4 isM9A = this.targetArea <= stepsMA;
                    bool4 isMAB = this.targetArea <= stepsMB;
                    bool4 isMBC = this.targetArea <= stepsMC;
                    bool4 isMCD = this.targetArea <= stepsMD;
                    bool4 isMDE = this.targetArea <= stepsME;
                    bool4 isMEF = this.targetArea <= stepsMF;
    
                    state.lo =
                        Hlsl.Select(isM01, m0,
                        Hlsl.Select(isM12, m1,
                        Hlsl.Select(isM23, m2,
                        Hlsl.Select(isM34, m3,
                        Hlsl.Select(isM45, m4,
                        Hlsl.Select(isM56, m5,
                        Hlsl.Select(isM67, m6,
                        Hlsl.Select(isM78, m7,
                        Hlsl.Select(isM89, m8,
                        Hlsl.Select(isM9A, m9,
                        Hlsl.Select(isMAB, mA,
                        Hlsl.Select(isMBC, mB,
                        Hlsl.Select(isMCD, mC,
                        Hlsl.Select(isMDE, mD,
                        Hlsl.Select(isMEF, mE,
                                           mF)))))))))))))));
    
                    state.hi =
                        Hlsl.Select(isM01, m1,
                        Hlsl.Select(isM12, m2,
                        Hlsl.Select(isM23, m3,
                        Hlsl.Select(isM34, m4,
                        Hlsl.Select(isM45, m5,
                        Hlsl.Select(isM56, m6,
                        Hlsl.Select(isM67, m7,
                        Hlsl.Select(isM78, m8,
                        Hlsl.Select(isM89, m9,
                        Hlsl.Select(isM9A, mA,
                        Hlsl.Select(isMAB, mB,
                        Hlsl.Select(isMBC, mC,
                        Hlsl.Select(isMCD, mD,
                        Hlsl.Select(isMDE, mE,
                        Hlsl.Select(isMEF, mF,
                                           mG)))))))))))))));
    
                    state.pivot = (state.lo + state.hi) / 2;
                }
    #else
        #pragma error Must #define SEARCH_ARITY_2, _4, _8, or _16
    #endif
            }
    
            // Implementation of HiLoShader for when p=0 or p=1
            // It just returns the min or max value of the pixels within the kernel
            [D2DInputCount(1)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct HiLoP0orP1Shader
                : IPixelShader<HiLoP0orP1Shader>
            {
                static IPixelShaderTransformImpl IPixelShader<HiLoP0orP1Shader>.CreateTransform(in HiLoP0orP1Shader shader)
                {
                    return new HiLoShaderTransform(Unsafe.BitCast<int4, RectInt32>(shader.samplingRectXYWH));
                }
    
                private readonly bool selectMinOrMax;
                private readonly int4 samplingRectXYWH;
    
                [AutoConstructorIgnore]
                [D2DResourceTextureIndex(1)]
                private readonly D2D1ResourceTexture1D<float4> samplingOffsetsRle;
    
                public float4 Execute()
                {
                    float4 min = (float4)float.PositiveInfinity;
                    float4 max = (float4)float.NegativeInfinity;
    
                    int sorLength = this.samplingOffsetsRle.Width;
                    for (int sori = 0; sori < sorLength; ++sori)
                    {
                        int3 dxdyLen = Hlsl.AsInt(this.samplingOffsetsRle[sori].XYZ);
                        float dy = dxdyLen.Y;
                        int dxEnd = dxdyLen.X + dxdyLen.Z;
    
                        for (int dx = dxdyLen.X; dx < dxEnd; ++dx)
                        {
                            float2 offset = new float2(dx, dy);
                            float4 sample = D2D.SampleInputAtOffset(0, offset);
                            min = Hlsl.Min(min, sample);
                            max = Hlsl.Max(max, sample);
                        }
                    }
    
                    return this.selectMinOrMax ? min : max;
                }
            }
    
            private sealed class HiLoShaderTransform
                : RefTrackedObject,
                  IPixelShaderTransformImpl
            {
                private readonly RectInt32 samplingRect;
    
                public HiLoShaderTransform(RectInt32 samplingRect)
                {
                    this.samplingRect = samplingRect;
                }
    
                public void MapInputRectsToOutputRect(
                    ReadOnlySpan<RectInt32> inputRects,
                    ReadOnlySpan<RectInt32> inputOpaqueSubRects,
                    out RectInt32 outputRect,
                    out RectInt32 outputOpaqueSubRect)
                {
                    MapInvalidRect(0, inputRects[0], out outputRect);
                    outputOpaqueSubRect = default;
                }
    
                public void MapOutputRectToInputRects(RectInt32 outputRect, Span<RectInt32> inputRects)
                {
                    for (int i = 0; i < inputRects.Length; ++i)
                    {
                        MapInvalidRect(i, outputRect, out inputRects[i]);
                    }
                }
    
                public void MapInvalidRect(int inputIndex, RectInt32 invalidInputRect, out RectInt32 invalidOutputRect)
                {
                    switch (inputIndex)
                    {
                        case 0:
                            RectInt64 rect0 = new RectInt64(
                                (long)invalidInputRect.X + this.samplingRect.Left,
                                (long)invalidInputRect.Y + this.samplingRect.Top,
                                (long)invalidInputRect.Width + this.samplingRect.Width,
                                (long)invalidInputRect.Height + this.samplingRect.Height);
                            RectInt64 rect1 = RectInt64.Intersect(rect0, RectInt32.LogicallyInfinite);
                            invalidOutputRect = (RectInt32)rect1;
                            break;
    
                        case 1:
                        case 2:
                            invalidOutputRect = invalidInputRect;
                            break;
    
                        default:
                            throw new IndexOutOfRangeException();
                    }
                }
            }
    
            [D2DInputCount(1)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct MinHorizontalShader
                : IPixelShader<MinHorizontalShader>
            {
                static IPixelShaderTransformImpl IPixelShader<MinHorizontalShader>.CreateTransform(in MinHorizontalShader shader)
                {
                    return new RadiusRectTransform(shader.radius, 0);
                }
    
                private readonly int radius;
    
                public float4 Execute()
                {
                    float4 min = (float4)float.PositiveInfinity;
    
                    for (int dx = -this.radius; dx <= this.radius; ++dx)
                    {
                        min = Hlsl.Min(min, D2D.SampleInputAtOffset(0, new float2(dx, 0)));
                    }
    
                    return min;
                }
            }
    
            [D2DInputCount(1)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct MinVerticalShader
                : IPixelShader<MinVerticalShader>
            {
                static IPixelShaderTransformImpl IPixelShader<MinVerticalShader>.CreateTransform(in MinVerticalShader shader)
                {
                    return new RadiusRectTransform(0, shader.radius);
                }
    
                private readonly int radius;
    
                public float4 Execute()
                {
                    float4 min = (float4)float.PositiveInfinity;
    
                    for (int dy = -this.radius; dy <= this.radius; ++dy)
                    {
                        min = Hlsl.Min(min, D2D.SampleInputAtOffset(0, new float2(0, dy)));
                    }
    
                    return min;
                }
            }
    
            [D2DInputCount(1)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct MaxHorizontalShader
                : IPixelShader<MaxHorizontalShader>
            {
                static IPixelShaderTransformImpl IPixelShader<MaxHorizontalShader>.CreateTransform(in MaxHorizontalShader shader)
                {
                    return new RadiusRectTransform(shader.radius, 0);
                }
    
                private readonly int radius;
    
                public float4 Execute()
                {
                    float4 max = (float4)float.NegativeInfinity;
    
                    for (int dx = -this.radius; dx <= this.radius; ++dx)
                    {
                        max = Hlsl.Max(max, D2D.SampleInputAtOffset(0, new float2(dx, 0)));
                    }
    
                    return max;
                }
            }
    
            [D2DInputCount(1)]
            [D2DInputComplex(0)]
            [D2DInputDescription(0, D2D1Filter.MinMagMipPoint)]
            [D2DGeneratedPixelShaderDescriptor]
            [AutoConstructor]
            internal readonly partial struct MaxVerticalShader
                : IPixelShader<MaxVerticalShader>
            {
                static IPixelShaderTransformImpl IPixelShader<MaxVerticalShader>.CreateTransform(in MaxVerticalShader shader)
                {
                    return new RadiusRectTransform(0, shader.radius);
                }
    
                private readonly int radius;
    
                public float4 Execute()
                {
                    float4 max = (float4)float.NegativeInfinity;
    
                    for (int dy = -this.radius; dy <= this.radius; ++dy)
                    {
                        max = Hlsl.Max(max, D2D.SampleInputAtOffset(0, new float2(0, dy)));
                    }
    
                    return max;
                }
            }
    
            private sealed class RadiusRectTransform
                : RefTrackedObject,
                  IPixelShaderTransformImpl
            {
                private readonly int radiusX;
                private readonly int radiusY;
    
                public RadiusRectTransform(int radiusX, int radiusY)
                {
                    this.radiusX = radiusX;
                    this.radiusY = radiusY;
                }
    
                public void MapInputRectsToOutputRect(
                    ReadOnlySpan<RectInt32> inputRects,
                    ReadOnlySpan<RectInt32> inputOpaqueSubRects,
                    out RectInt32 outputRect,
                    out RectInt32 outputOpaqueSubRect)
                {
                    MapInvalidRect(0, inputRects[0], out outputRect);
                    outputOpaqueSubRect = default;
                }
    
                public void MapInvalidRect(int inputIndex, RectInt32 invalidInputRect, out RectInt32 invalidOutputRect)
                {
                    invalidOutputRect = InflateHelper(invalidInputRect, this.radiusX, this.radiusY);
                }
    
                public void MapOutputRectToInputRects(RectInt32 outputRect, Span<RectInt32> inputRects)
                {
                    inputRects[0] = InflateHelper(outputRect, this.radiusX, this.radiusY);
                }
    
                private static RectInt32 InflateHelper(RectInt32 rect, int radiusX, int radiusY)
                {
                    // First, do calculations at 64-bit to avoid overflow
                    long left = rect.Left - radiusX;
                    long top = rect.Top - radiusY;
                    long right = rect.Right + radiusX;
                    long bottom = rect.Bottom + radiusY;
    
                    // Create a 64-bit rectangle
                    RectInt64 result64 = RectInt64.FromEdges(left, top, right, bottom);
    
                    // Clamp (intersect) the rectangle to the 32-bit "logically infinite" area, then cast do a 32-bit rectangle
                    RectInt32 result = (RectInt32)RectInt64.Intersect(result64, RectInt32.LogicallyInfinite);
    
                    return result;
                }
            }
        }
    }

     

     

  6. Just now, _koh_ said:

    I want to match intermediate buffer precision to the original buffer precision

     

    This information isn't available in the plugin interfaces. I would simply add an option in the UI for low vs. full precision.

     

    2 minutes ago, _koh_ said:

    When you said you are calling HiLo() 12 times to avoid banding

     

    From what I could understand from the algorithm, each call to HiLo() essentially calculates 1 bit of precision starting from the most-significant bit. When working with linearized pixels (that is, WorkingSpaceLinear instead of WorkingSpace), you need up to 12-bits because the values are spread out differently.

     

    The increase from 8 to 12 is pretty dramatic with some images, but I could only see very minute differences after that. Even 11 to 12 was very small, but still noticeable upon close inspection.

     

    Going forward it may be necessary to run up to 16 times, but it's easy to make it configurable for when that comes up.

  7. 40 minutes ago, _koh_ said:

    This particular shader only takes straight alpha sRGB and outputs straight alpha sRGB

     

    btw this won't necessarily be true in future releases of Paint.NET

     

    First, the upcoming v5.1 will have color management -- so an effect will either receive pixels in the image's "working space" (which is currently de facto sRGB) which is still the unmodified BGRA32 values, or the pixels will be converted to the linearized version of the image's actual color profile (WorkingSpaceLinear is the default). As a backup, in case the color profile can't be linearized, the image will be converted to scRGB (linear sRGB). The effect's output will then be automatically converted back to the storage format of the image.

     

    Second, for future releases I am planning on adding higher-precision pixel formats like RGBA64 (4 x uint16), RGBA64Half (4 x float16), and even RGBA128Float (4 x float32). 

     

    In other words, I would not rely on PrecisionEffect(UInt8Normalized) as a way to maintain the original precision -- because that won't be true in the future. I designed the new effect systems with future-proofing in mind!

  8. 1 hour ago, _koh_ said:

    Now I'm using PrecisionEffect instead of CDC.Bitmap which gives me the same result at the same performance but I don't know what it's actually doing. Does it create intermediate buffer?

     

    Another thing to note is that Paint.NET always runs effects at the highest precision (32-bit float per component / 128-bits per pixel). The SourceImage is still stored on the GPU as 32-bit BGRA, but is then premultiplied and/or color converted using 128-bpp to ensure the best quality. By using PrecisionEffect you are manually reducing the precision, which as you've seen can improve performance. However, it will of course reduce precision and color accuracy.

     

    IMO it's not worth it, unless you're using caching (set effect.Properties.Cached to true) and you set the precision to Float16. This (caching) is almost never necessary, however, and should only be used very carefully and sparingly.

  9. On 2/4/2024 at 11:05 PM, _koh_ said:

    Looks like you get some gain if you do it properly.

     

    This compute shader's performance advantage seems to be that it greatly reduces the number of texture sampling instructions. It does not reduce the computational requirements -- each output pixel still needs to do the same amount of work. But there's up to an 87.5% reduction in texture sampling instructions because a sample that is used to compute multiple output pixels is only retrieved once. It likely doesn't reduce VRAM bandwidth because the GPU would be using an internal cache (e.g. L2) anyway, but it will reduce the bandwidth pressure on that internal cache.

  10. PrecisionEffect is a pass-through effect that uses a pixel shader to read the input image. This ensures Direct2D can't optimize it away. So yes, it is essentially forcing an intermediate buffer so that the next effect in the chain will consume the source at the given precision.

     

    Source -> Precision -> NextEffect

     

    This contrasts with PassthroughEffect which is a proper "passthrough" effect -- it uses ID2D1TransformGraph::SetPassthroughGraph() so it essentially "washes away" at render time as if it didn't even exist in the first place. It's not really useful for an effect graph, but it does have uses in some niche cases for architectural purposes. DynamicImage (e.g. PdnDentsEffect) uses this so that it can hand you the PassthroughEffect which you can plug into an effect graph, but then it can change which  image/effect is plugged into that PassthroughEffect. This means you don't have to keep retrieving the DynamicImage's "output" when you change its properties (DynamicImage is not actually an ID2D1Image/ID2D1Effect).

     

    It's very beneficial to use PrecisionEffect instead of a CompatibleDeviceContext.Bitmap because 1) that let's Direct2D manage the rendering process and memory management, and 2) it permits Paint.NET to manage rendering with tiles along with progress reporting and cancellation support. Otherwise you're forcing everything to render during OnCreateOutput(), during which there is no progress reporting or cancellation support.

×
×
  • Create New...