Sign in to follow this  
phrebh

Persistence needed during render loops

Recommended Posts

Well, I updated my color averaging plugin to fix multithread problems, but now I see there are some additional comments about synchronization that make me a little cautious about posting the new DLL. Here's my current approach:

private volatile bool firstTime = true;
private object myLock = new object();
private ColorBgra averageColor = new ColorBgra();
...
if (firstTime)
{
  lock (myLock)
  {
      if (firstTime)
      {
          // Compute average color
          ...
          firstTime = false;
      }
  }
}

I'd already added the volatile keyword before reading the new comments; not because I thought about the reordering problem Neil mentioned, but because I was worried that a clever optimizer might delete the second test of firstTime as redundant.

I doubt making firstTime a volatile would have a significant effect on efficiency (though I don't know much about the issue), but if it would, I wonder if the following would avoid the problem:

private bool firstTime = true;
private volatile bool colorComputed = false;
private object myLock = new object();
private ColorBgra averageColor = new ColorBgra();
...
if (firstTime)
{
  lock (myLock)
  {
      if (!colorComputed)
      {
          // Compute average color
          ...
          firstTime = false;
          colorComputed = true;
      }
  }
}

The idea is to only test the volatile variable inside a seldom-executed section. Since the purpose of the outside IF test is simply to avoid the inefficiency of going through the lock, it wouldn't matter if once in awhile it failed due to multithread timing.

Share this post


Link to post
Share on other sites

If you're inside of a critical section then you won't need to use volatile. You only really need to use volatile if you're writing lock-free code and you know exactly what you're doing (which excludes almost every situation, and everyone (including me)).

Share this post


Link to post
Share on other sites

I did a few timing tests on fairly large images, comparing my original thread-unsafe version to the version where firstTime is declared volatile and tested outside and inside the locked region. The new version doesn't seem to be much slower, so I'll probably just go with that version.

Share this post


Link to post
Share on other sites

Does your code really really really really need this much optimization?

I'd shoot for correctness first. Paint.NET is already multithreading the thing; your biggest performance gains are going to be from dual core (+100% performance), and improvements to .NET's code generation (not in your hands). This stuff you're looking at will probably only net you a few %. I just am not convinced that it's worth it.

Share this post


Link to post
Share on other sites

I think I chose the simplest option that insures thread safety and avoids going through the lock for every invocation. At least I meant to. Dual core improvements really don't help me much, unless I want to buy a new computer.

Share this post


Link to post
Share on other sites

Note that your intuitive average function is wrong, independantly of the color space in which you want to compute it (RGB or HSV).

Suppose that the color space was just a single greyspace, and you are computing the average of 4 pixels, what you are ttrying to compute is:

average(average(average(average(average(g[0,0],g[0,0]),g[0,1]),g[1,0]),g[1,1])

What you'll get from such formula is not the linear average of greys, but the weighted average, where the last pixel takes half the weight, the 3 previous one together sharing the other half of the total weight.

In other words, the two first pixels will have weight 1, the third will have weight 2, the last one will have weight 4. On a much larger image, you'll see that the resulting image will have the weight depending mostly from the color of the most lower left corner of the image: change *only* that bottom-right pixel to white or to black and you'll see that it will have a *very* large effect on your final average. Now change only the top-left pixel between white and black, and it will have absolutely no effect on the computed average.

Suppose for example that the 4 grey values are (in your prefered colorspace): 16,32,64,128 in that order, the average you get from them will be successively: 16, 24, 44, and finally 86. But if pixels are in the reverse order, you'll get successively: 128,96,64, and finally 40. The result is very different because pixels are not weighted identically.

In fact, with your false method for computing averages, the more you have pixels, the more this different will be important! In fact, when the computed average value must fit into an 8-bit integer, ONLY the last 8 pixels on the left of/ the last bottom row will have an importance, and all the rest of the source image will be completely ignored: if the source image is a large and high image almost completely plain white, with only the last 8 pixels plain black, the average you'll get will be plain black ! This will clearly not be an "average".

A true average needs to be done differently: you must just sum the color components separately within their planes, and must count the number of pixels for which you take the components. You cannot use the intermediate average unless you keep its relative weight (which grows linearily when you scan the image from top-left to bottom-right, while the weight of newly added pixels do not grow but remains constant.)

So what you really want is:

- convert all pixels from RGB to HSV (or to any color space in which you want to compute the average)

- sum all pixels you want to take in separate planes, with one summing varaible for each color component (sumH, sumS, sumV)

- finally divide each component by the number of pixels taken in the sum, you'll get the HSV color: (sumH/count, sumS/count, sumV/count)

- ONLY then, convert that final HSV color to RGB.

But looking at the HSV<=>RGB conversion code you use, it looks like it is completely a linear transform, which is basically a rotation and skew within a 3D space: computing the average in a linearily transformed intermediate space will not be different from computing it directly from the original RGB space:

The only way the HSV average would be different would be when your transform is not fully linear: for example you use gamma transform, or the HSV calues are clipped to remain a constrained space, using if() tests or min() or max(). but here, you don't seem to use any gamma transform (in fact, in true photography or TV colorspaces, the gamma function is used only within a subrange of the component values, and the bordering values (near the min and/or max values of the range) are just transformed by a linear transform (to avoid saturation effects). But even in this case, you'll have some saturation, and the component will be clipped into the validity domain: this really happens in your case because you're in fact using a part of the YCbCr colorspace transform (typically used in videos on DVD and HDTV).

But your main problem is elsewhere: your code is not warranteed to use a precomputed average, before rendering a band: you cannot control the order in which the various threads will run, and some will start running, then will be delayed, then another one will start and will finish by filling its ROI using a non-computed average, then the first thread will come back to terminate the average computing and will use it. In fact you'll see that some threads will use the computed "average", and some won't, using only partially computed averages stored in your global class.

Your example is the perfect example of why Render() routines MUST NOT depend on the result computed by other concurrent threads: each call to Render() MUST be able to compute its ROI completely by computing its results ONLY from the parameters it receives: the full source image and custom plugin parameters. It MUST only draw within the dest rectangle. All it has performed will have NO impact on the run in concurrent thread.

You plugin is the perfect example of why we need multipass effects, because otherwise each thread will have to perform the same computing several times from the same source image: each one MUST compute the full image average before writing the result only in their rectangle of interest of the destination image.

So the first solution given to your problem is the good one (except that it does not use the colorspace transform and uses simple blending, using unnecessarily unsafe code, just to try making it faster: but it effectively computes the same final average/blended color by computing it several times, i.e. once on the full image within each thread).

One way to go is then for each thread to try getting a lock in order to decide which one will be the main thread in charge of computing the average. Others will wait for that exclusive lock, and when they'll get it, they will see if the average has been computed and if so will take the computed average and will release their lock (allowing other threads to do the same to get the same average value). Then all threads will draw into the ROI of the destination image using the average that they got from their critical section.

What you'll see: nothing in any thread, as long as the first thread that took the lock has not finished computing the average. When this thread has finished, it must store the average, and set a "already computed" flag to inform others that they don't need to compute it (when they'ell get the lock) and then only release the lock and start drawing concurrently with other threads that will also use the same computed value.

You should look at the other forum thread that discusses about how to create multipass effects because this is exactly what you need.

So you have at least three things to debugs:

- correctly managing the multiple passes and avoiding unnecessary multiple computing of the same average

- computing the average correctly (which you don't as you use variable pixels with very different weights, producing only weighted averages in which the most bottom-right pixels counts as much as ALL the rest of the image).

- correcting your colorspace transform.

In a simple first approach, you should first start by not trying to compute the average only once: yes it will be slower (about 50 times for the average color computing time, but not so long because the other part of the computing time is in filling the destinatoon ROIs that you won't perform several times), but this is the first thing to correct.

Then correct your colorspace transform (use the value range clipping, or gamma correction (i.e. exponential<=>logarithm), including the linear segments near the extreme values): RGB to YCbCr (or YUV), and YCbCr to RGB; the effective transforms are NOT just applying a simple linear multiplicatation of a vector by a matrix, because YCbCr and YUV color models are not the result of just a simple linear matric transform (there are distinct ranges of values, and gamma correction changes things radically; but you don't need to use HSV mapping which just uses some polar coordinates and will need some slow trigonometric functions that don't always respect our human perception of colors).

Then write the mutex code to avoid multiple concurrent threads to compute the same FULL image weight.

Share this post


Link to post
Share on other sites
I'll have to think about the best way to solve the problem. Normally I tend to use the most straight-forward approach unless there's a compelling reason to do something more complicated. Even though in the current case I'm not convinced that just changing where the firstTime flag is cleared isn't an entirely adequate solution, I'll probably use the more proper locking approach anyway. I assume the code is:

if (firstTime)
{
  lock (myLock)
  {
      if (firstTime)
      {
          // Compute average color
          ...

          firstTime = false;
      }
  }
}

(I intend to eventually follow BoltBait's suggestion and write plugins outside of CodeLab. I can then take advantage of the OnRender function to compute the average.)

That pseudo-code is incorrect : you MUST NOT test the "firsttime" variable before betting the exclusive lock! And if you test it outside the lock (to see if you need to get it) you test it again in the critical section.

The correct code is:

if (!myLock.averageComputed) lock (myLock) {
  if (!myLock.averageComputed) {
          // Compute average color on full source image
          myLock.averageColor = average(sourceSurface);
          myLock.averageComputed = true;
      }
  }
}
// draw in destination ROI in destSurface using the mylock.averageColor

Then you must make sure that there's only one "mylock" object. It must be declared as a static, initialized in its constructor with the averageComputed field or property set to false.

If you want your effect to be used several times on distinct regions, you'll need a "reset" or "run now" button in the parameters dialog that, when it is pressed will simply rest this "mylock.averageComputed" flag to false, because static objects remain in their current state in memory, as long as the effect DLL is loaded in PaintDotNet.

Share this post


Link to post
Share on other sites

There was some discussion a while ago about averaging hue values. The average of 1° and 359° should intuitively be 0°, or equivalently 360°, but a naive average calculation (summing the hues and dividing by n) returns 180°... an entirely different color altogether. I came across a nice solution while working on my own effect, and thought that I'd put together some code to share with everyone. I didn't test it much, so use at your own risk.

In fact, what you try to compute is the center of vertices of a polygon on the unit circle, by first converting its coordinates from polar to cartesian coordinates, and then converting this center back to polar to get the angle.

I think that you can be much faster, because the actual hue value in HSV is not a true angle but a position measured along the perimeter of an hexagonal color wheel. So instead of using a circle, convert the angles to hexagonal perimeter values: no trigonometry (sin, cos, atan) is needed for that, just linear algebra and range tests.

Why an hexagon : look at the color cube as it is effectively distorted slightly by the CIE color model and then rotate it as if you were hanging it with a thin thread linked to the white corner (the black corner laying vertically at the lowest altitude) : the lightness is the altitude of points in this distorted dice. Rotate this distorted dice around the thread, when you look at it horizontally on the plane passing through the dice center: the saturated colors will all pass in front of your eyes as the points nearest from you. These points are effectively disposed along an hexagonal on your horizontal plane.

The rotation angle of the dice is what you generally call the "hue" angle (but it is NOT the very approximated "angle" used in HSV which is really a series of tangent values for distinct ranges of the effective angles). But you absolutely don't need to compute the angles in order to project (vertically) any color point in the distorted dice on your horizontal plane: you can use the projected positions on that plane to compute a mean point in this plane, and its position can be used to compute any color in the color dice that has the same hue: this is just a half plane delimited by the thread and passing by this mean point (you cannot define this half plane if the mean point is on the thread: it is simply grey, and it can have any arbitrary hue value).

All this computing does not require any trignonometry, just linear transforms to compute the means you want (and you don't even need to use several ranges). However, you may need to use non linear transforms to effectively distort the color cube when taking the gamma value into account: this requires logarithms or exponentials, depending on the direction of transforms (between RGB linear energy to the CIE-based perceptual lighting model).

The usual assumption that colors form a perfect circular wheel is wrong given that it does not respect the effective gammut of the sRGB color model which is more restricted. In fact it is really a sort of ellipsoid (when considering 3 color pigments) which is notably distorted by the exponential gamma factor (usual display devices or rendering surfaces use a 1.2 gamma value, not 1.0, but they are even more distorted by some electronic adjustments in order to create near black values, using a small linear transform instead. If you applied the same transform to the theoretical sRGB color cube, you would get a sort of "patatoid" with some curved facets of a parallelogram and additional facets with different curvatures created by the linear adjustments near black, but still curved by the perceptual gamma).

For this reason, I absolutely don't like the color circle and what it produces ("hue" pseudo-angles). I much prefer the color hexagon which is more exact but is still much simpler to compute than the color circle: this is what is used effectively in the YUV (or related YCbCr based on another set of fundamental pigments) color models, which also use other white and black points within the color dice, to align the vertical grey thread on which the distorted dice (used by the rendering device) is hanging.

----

However, the exact behavior of your "mean hue" function should really weight the hue values according to their significance: colors along the grey line (from black to white, as they are defined in the chosen rotated color cube depending on the CIE-based color model you choose) should be weighted 0 as the hue is not significant at all (so they can be safely ignored), and colors with the maximum saturation should be weighted 100%.

One simple way of computing the weight of hue values is to girst normalize S between 0%=0 and 100%=1, so that it can be used directly as the weight for computing the weighted average of hue values (it can be used in your method that remaps angles to cartesian positions on a circle, or with a faster method using the square directly without computing angles). But you can use S values directly if they are already in a constant range (independant of the other color components), because, anyway, you'll not only have to sum the weighted coordinates of hue points, but also the weights (S), before dividing the sum of weighted coordinates by the sum of weights.

The only case where this weighted average will not return a point is when the sum of weights is null. This can only happen if all pixels in your sample are grey (with saturation zero, note that there can't be negative saturation, zero is its strict minimum) and even in this case, the hue value can be arbitrary given that the computed mean color will also be grey...

I intuitively think that you'll get more meaningful results than just computing a non weighted "mean hue" from only the H components: the number of black/grey/white points in your sample will have absolutely no influence on the computed mean hue, and the most saturated colors will play the most important role in this mean, even if there's only 1 pixel with a non-grey color and all other pixels are grey.

One good test of the algorithm would be to blur an image with this HSV mean, computed over a spot with small radius, on an image that displays for example an antialiased polygon filled with plain blue, and drawn over a plain white or black background: you should not see any random colors along the border of this polygon (because of antialiasing pixels that have very small saturation or because of the white or black background), the computed mean hue values should remain blue in all cases along the polygon border; such defect is something that can happen very easily with your algorithm, which is also not numerically stable.

Another test must also be done with an antialiased thin line using near grey colors, and with a antialiased black-filled polygon, both drawn on a plain white surface: here also, you should not see any visible instable colors along the line or polygon borders after applying the same blur function.

Share this post


Link to post
Share on other sites
If you want your effect to be used several times on distinct regions, you'll need a "reset" or "run now" button in the parameters dialog that, when it is pressed will simply rest this "mylock.averageComputed" flag to false, because static objects remain in their current state in memory, as long as the effect DLL is loaded in PaintDotNet.

Hi verdy_p,

You can reset the flag in OnSetRenderInfo()

Share this post


Link to post
Share on other sites

verty_p:

That pseudo-code is incorrect : you MUST NOT test the "firsttime" variable before betting the exclusive lock! And if you test it outside the lock (to see if you need to get it) you test it again in the critical section.
You might want to look at the code a little closer. I do test the firstTime flag both inside and outside the critical section. The outside test is simply for efficiency.

As far as the claim it must be static, I'm not so sure that's true. While I only have a single core system, so I can't test it myself, I believe non-static variables defined at the class level are properly shared between different cores. If they weren't, I'd expect the average color -- which is not a static variable -- to be incorrect for any core that didn't compute it. Yet, based on comments by Neil Cassidy, I believe the color is correct. Perhaps Neil will comment on the matter.

You can download the effect yourself to confirm that as currently written it works properly when run several times for distinct regions.

[Added the next day] As I mentioned in a previous comment on this thread, I don't have a clear understanding of how multithreading is implemented on multicore or multiprocessor systems. It's possible each core has its own copy of the structure. If that's the case, the code will still work, the only difference is that each core will compute its own average color. That might be slightly slower, but even that's questionable, since it's mostly a case of replacing time computing the average with time waiting on the semaphore for another thread to compute the average.

Share this post


Link to post
Share on other sites

Hi MJW, hope all's well with you!

I'm pretty sure you would have to lock on the static variable on every access (or declare it as volatile) if you wanted to ensure that all threads would have the same view of it at any given time. I agree that making that variable static is unnecessary. It doesn't need to be static to ensure correctness of the computation, only private.

verdy_p, I think that making it static is an error in this case, because its fields are used to store the result of the computation, and a flag indicating whether the computation has finished. As Tim! suggests, one might just reinitialize this variable's fields in OnSetRenderInfo. Or one might use the "reset button" approach that you described. Both of these solutions require that we run some piece of code that executes exactly once every time the effect is run to fix the problem... which is exactly what this segment was intended to do in the first place. I feel that using a static variable to store the result of a computation is perhaps a questionable thing to do when the result is only intended to be used by a single instance of the effect class.

Edit: scrapped a bunch of stuff because it may or may not be correct.

Some references:

http://www.bluebytesoftware.com/blog/20 ... orNET.aspx

http://www.bluebytesoftware.com/blog/20 ... Model.aspx

http://www.bluebytesoftware.com/blog/20 ... odels.aspx

Share this post


Link to post
Share on other sites

Hi Neil. The firstTime flag is declared volatile, though my reason for doing so was simply to insure that the optimizer didn't get clever and remove the second test. Thanks for your comments on this and the references you posted. I'll read them carefully so that maybe I'll have a slightly clearer idea of how the multicore stuff works.

Share this post


Link to post
Share on other sites

No problem MJW. You might be right about using volatile. Although the language specification only refers to reordering optimizations, MSDN does say that single-threaded optimizations won't be performed on volatile variables, and I suppose that from the point of view of a single thread, that particular optimization might be permitted, as risky as it seems. Some similar code is claimed to be correct in this MSDN article, namely Figure 7, which suggests that the optimization in question won't be performed... but who knows. In the course of reading about this, I've seen many different solutions to this problem which take into account subtly different assumptions.

At any rate, the memory barrier solution that I suggested earlier may or may not be sufficient to prevent the computation from occurring twice, depending on whether or not the second if-test could be removed. It might not even synchronize properly on some systems without a second memory barrier before the first if-test, and this may be inefficient. For the sake of brevity I won't get into any of the details. I've edited some of my posts above to make this a bit less confusing for readers.

So that essentially leaves volatile as the only optimized solution that's not in doubt for some reason. It's worth noting that the author of that article suggests either avoiding low-lock techniques altogether or assuming the weakest memory model possible (mark everything volatile when in doubt). After all of this confusion over a small optimization, I think I might keep this in mind!

Share this post


Link to post
Share on other sites
verty_p:
That pseudo-code is incorrect : you MUST NOT test the "firsttime" variable before betting the exclusive lock! And if you test it outside the lock (to see if you need to get it) you test it again in the critical section.
You might want to look at the code a little closer. I do test the firstTime flag both inside and outside the critical section. The outside test is simply for efficiency.

As far as the claim it must be static, I'm not so sure that's true. While I only have a single core system, so I can't test it myself, I believe non-static variables defined at the class level are properly shared between different cores.

Non-static member variables will be shared ONLY if the render class is not instanciated several times. Nothing warranties in PaintDotNet that there will not be separate instances of this class, one for each worker thread (even if the flag may be reset in the OnSetRenderInfo even handler). we have no clear info about how many renderer class instances will be allocated to process our effects, notably on multicode CPUs or on multi-CPU systems (including distributed systems that may run on several hosts).

We lack a specification for getting a strong unique reference to a whole task to process on a single source image. We don't have such reference to create per-task variables, the render object only has members for isolated worker threads...

And there's currently no mechanism to make the task unique object glbally visible (including through proxies to synchronized remote instances).

Given that .Net is not made to run only in a single host (the virutalized environment is independant of the host itself which may be multiple), we are in some gray area here (so we can just suppose that PaintDotNet DLLs will not be used in multihosted VMs... but this may be simply wrong if PaintDotNet is run itself from a virtual OS: a .Net VM in a guest OS VM running in a multihost application server using MS VirtualPC, MS Virtual Server, Sun Virtual Box, Citrix, or other OS virtualization software (the host OSes supporting the guest OS where PaintDotNet is started may even be something else than Windows, including Linux, MacOSX, Solaris, BSD, possibly over an heterogeneous network, working together to offer a multi-cpu and multicode virtual environment with virtual shared memory and various levels of memory barriers for read/write order consistency...

A thread may even start running on a host, then be yielded to continue on another host : the VM will manage the constistant transport of the current state of this thread across several guest OSes, possiblyt virtualized themselves, guest OS processes, guest OS threads, .Net VM processes, and .Net VM threads, and even possibly across other lightweight threads like fibers and Win7 tasks, provided that the code does not use "unsafe" bindings to native memory or threads/processes). As long as there's no such warranty, all that can be done is to use a static class member variable and synchronize it within the safe (virtual) .Net memory model (things will be even more complicate if you bind your code to native memory or binary code, which will prohibit the .Net environment from safely moving one .Net thread to another CPU or guest OS).

Clearly, each worker thread that runs our render function cannot state anything about which task they are part from, all we can track is the references of the source and destination images, something that is not really and correctly identifying a distinct task made of several threads for separate ROIs. The fact that PaintDotNet was designed to hide such linkage/tracking info in our render routines is, in my opinion, a API DESIGN ERROR: we should have a parameter in the Render() routine pointing to the object containing the description of the task effectively shared by the multiple render threads, and a way to control its lifetime independantly of the lifetime of worker threads (the OnSetRenderInfo is not enough as it just tracks the initialization of several threads that are just supposed to run in the same .Net process, but this is not warrantied).

Finally, note that the main Render() routine has the vocation of being recompiled at run-time to effectively run in a GPU core instead of the CPU, because it would be much faster there, and could benefit from more massive parallelization (typical GPUs today include about 32 parallel processing units, usable for example from DirectX, Direct3D, OpenCL, OpenGL, nVidia's PhysX, and why not accessible automatically to virtual .Net environments as well...). This could allow our render DLLs to perform real-time effects, including for animations or for video editing (if PaintDotNet integrates the suppport for multiframed images and frame numbering by synchronous timestamps or by frame counters, all in the same media format, either in a file or in a real-time stream).

Share this post


Link to post
Share on other sites

.NET threads will correspond to Win32 threads. The abstraction between "managed thread ID" and "native thread ID" mapping does exist, but it only matters if you are doing custom .NET hosting, e.g. SQL Server. Also, .NET is not a virtual machine or environment, so you are doing yourself a disservice by thinking that way.

As a corollary, in Paint.NET you can be sure that a managed thread will not be "migrated" among native threads, or processes. Fibers are never used anywhere.

And you're right that Paint.NET has a lot of "ambitious ambiguity" in the contract. However, at this point (5 years later) it's safe to say:

1) OnSetRenderInfo() will always be called first, and always by itself.

2) OnRender() will be called next. The first ROI is always executed by itself. After that, parallelism ensues.

Also, the token that is given to the effect is a Clone() of the one from the effect config dialog. From there, each thread used for OnRender() has its own Clone() of the token.

So, there is some "bracketing" that takes place. You can be assured that calls to OnSetRenderInfo() and OnRender() will never overlap when drawn on a timeline.

Also, in v3.5.2 I am adding an IsCancelRequested property that you can poll from within OnSetRenderInfo() or OnRender(). The guarantee at that point is that when IsCancelRequested flips to true that any "in-flight" OnRender() calls will be discarded.

In a future version, I plan to redo the effect system to provide more control over execution and progress reporting type stuff. Basically you'll be given an object to report progress through, and a simple ReadPixels() / WritePixels() interface. From there, additional classes will be provided which will take care of chunking of ROIs, parallelism, etc. But at that point those things will be optional -- great for multipass effects, and also great for effects where a different ROI "chunking" strategy is preferred.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this