Sign in to follow this  
Rick Brewster

Performance figures for 64-bit Paint.NET v2.6 (now w/ video)

Recommended Posts

New, Jan. 19th 2006: I've made a video that clearly shows how much faster 64-bit is for Gaussian Blur. http://www.eecs.wsu.edu/paint.net/misc/ ... vs_x64.zip (fyi, it's a WMV)

Just thought I'd give a preview of what to expect from Paint.NET v2.6 in the performance department.

The good

This benchmark was performed on an Athlon 64 2800+ underclocked to 900mhz (default clock is 1.8ghz) running Windows Server 2003 x64 Edition. It's underclocked because it's my personal server and I want it to use less power and run cooler (it's undervolted as well); it's generally disk bound, not compute bound, so the half clock speed doesn't affect what it does much. It does have 2 GB of RAM, but only because I had some spare DIMMs that wouldn't work in my other system.

I opened a 1600x1200 image and performed a 100-pixel radius Gaussian blur. To do this correctly:

1. Open the image

2. Effects -> Blurs -> Gaussian Blur

3. Type 100

4. Hit OK

5. Press Cancel immediately

6. Press Ctrl+F

5 and 6 are necessary because we retain the 'preview' that was rendered so that it isn't recomputed when the user hits OK. It can save a lot of time when compared to v2.1. By adding these extra steps it makes sure that the entire effect rendering is benchmarked.

The stop-watch timer started at step 6 and stopped when the progress dialog disappeared.

Paint.NET v2.5, 32-bit on .NET 1.1 -- 3 minutes 14 seconds

Paint.NET v2.6, 32-bit on .NET 2.0 -- 2 minutes 30 seconds

Paint.NET v2.6, 64-bit on .NET 2.0 -- 1 minute 2 seconds

Very nice. Not only has 32-bit performance improved for an operation that uses a lot of 64-bit math, but the x64 results are fantastic. 3x the performance is great!

The bad

Startup takes longer, with about 4-5x the number of pagefaults. (Edit: The number for page faults should only be 1.5-2x here, actually. I did not have the ngen.exe compilation set up correctly, so these results were for just-in-time compilation) Memory usage is about 20,000 K higher. I'm looking into this, but it may not be fixable right now. I am told that the 64-bit .NET Framework 2.0 is currently optimized for server workloads and that the client optimization will come in a future release or service pack. Probably won't be a big deal since most 64-bit systems are more modern with higher clock speeds and more memory than their average 32-bit counterparts.

Oh, and startup is only slower for the 64-bit version. 32-bit startup is about the same.

Share this post


Link to post
Share on other sites

Both.

The code for Guassian Blur hasn't changed, but a lot of the other stuff will be changing. That said, it's not like it'll be taking 100% advantage of the new .NET 2.0 stuff.

Share this post


Link to post
Share on other sites

Also, all the icons in the UI have been redone so they look nicer. They're more Office 2003-style, whereas v2.5 and before were more Win95-style.

Share this post


Link to post
Share on other sites
I am told that the 64-bit .NET Framework 2.0 is currently optimized for server workloads and that the client optimization will come in a future release or service pack. Probably won't be a big deal since most 64-bit systems are more modern with higher clock speeds and more memory than their average 32-bit counterparts.

Out of curiousity, do you think it likely the entire OS is the same way, or is it simply that most (games, at least) apps ported don't need the extra capabilities at all?

Share this post


Link to post
Share on other sites
Out of curiousity, do you think it likely the entire OS is the same way, or is it simply that most (games, at least) apps ported don't need the extra capabilities at all?

I'm sure the server OS is optimized for server workloads, and the workstation release is optimized for workstation workloads (it's always been that way).

With regard to benefitting from 64-bit, it's a double-edged sword. At the assembly language level you get more registers to work with, and they're all twice the size. So any app that is doing 64-bit mathematics will benefit, as will any app that has a certain level of register pressure. In 32-bit mode you have something like 4 or 5 general purpose registers and they're only 32-bits. In 64-bit mode you have 8 general purpose 64-bit registers. So if you want to add two 64-bit numbers you have to shuffle things in and out and around a lot (in 32-bit mode), whereas in 64-bit mode you just do it.

However, 64-bit executables are larger by about 25-50%. The instructions just take up more space, which means several things. Programs take up more disk space (not a very big deal -- 50 cents a gig right now right?), but they also take more time to load because of this (boo). Then they take up more memory, and more memory bandwidth while the CPU is executing the instructions. The latter part is key -- even if you're getting more computational work done faster, you're stumbling over having to load more instructions out of memory sooner and more often. And memory is very slow.

In the server arena you have a huge benefit: you can work with huge amounts of memory. If I have a 14 GB database I can keep the whole thing in memory on a 64-bit system with enough RAM. In 32-bit I can only load 2 GB of that at one time if I'm really lucky. And since memory is orders of magnitude faster than disk, it's a huge win.

So yes, no, maybe. I would expect to see that, on average, applications will about break even on performance when moving from 32-bit to "native" 64-bit without any extra optimization. Then there will be specific applications, such as games and imaging apps (like Paint.NET!), where the developers spend extra time optimizing specifically for 64-bit, and is where you'll see some impressive performance increases. Remember that games are always optimized specifically for their target platform -- currently that's 32-bit Windows. No doubt developers will soon turn their eyes to 64-bit and how to make games perform even better on it. But just flipping a switch and recompiling for 64-bit won't usually gain you any performance (like I said, most apps will probably just break even).

Share this post


Link to post
Share on other sites

All the x64 game ports seem to load faster, though. I don't know what that's about. I used to have a theory, but it seems stupid now that I think about it.

I assume what you speak of with the registers is only 'available' ones instead of stuff held by the OS...I'm not a programmer, just a techie, so not exactly clear on that sort of thing. I do know 32 bit has 8 and 64 has 16, but hard numbers don't mean anything if you don't actually know how they're used :)

I do understand development in general though, and I would expect the gains won't be too visible until more of the Windows environment (MSXML, directx, and whatnot) itself is further optimized for 64 bit, since just about everything ties into certain tools.

Share this post


Link to post
Share on other sites

The main thing to know about the registers is that while the processor gives you 8 of them (or 4 or 16 or whatever), you can use as many variables as you want in your actual C or C++ or C# or whatever code. The compiler only gets those 8 registers to work with though, so once you use that 9th variable it has to swap things between registers and memory. You could think of the registers as almost an "L0" cache -- it's critically fast, but critically small. Just like how the L1 cache is much faster and smaller than L2 which in turn is much faster but smaller than main system memory (which in turn is much faster and much smaller than the hard disk).

And the 64-bit arithemetic is really key. In 32-bit mode each 64-bit value has to use 2 registers. Since the Gaussian Blur code in Paint.NET uses 64-bit values for all of its accumulation variables, the performance really takes off in 64-bit mode.

We actually have a fuller benchmark suite that exercises many other aspects of Paint.NET's code, and we're thinking of releasing it. Should be pretty cool :)

Share this post


Link to post
Share on other sites

To extend upon what Rick has said.

To a large extent performance of highly algorithmic code such as Paint.Net's filters is roughly proportional to the size in instructions of the hottest of the hot loops in the algorithm(1) and to some extent the size in bytes. If you are trying to do 64-bit math on a 32-bit X86 chip you will frequently end up not only with significant register pressure (which will cause you to go to cache more often) but also significantly more instructions. One of the primary reasons this occurs is a side effect of the fact that the chip requires we use two registers to hold a single 64-bit value, but it is even worse than that, most of the instructions which take 64-bit operands require that the same two 32-bit registers be used, so there will end up being a ton of instructions wasted moving things around from register to stack, from stack to register, etc..., and the hot inner loops often end up bigger.

Depending on the memory access characteristics of the algorithm the size of the inner loop may not end up being the defining characteristic of the performance of the algorithm, but it is frequently a good place to start when understanding these algorithms.

Footnote:

1) There are particular cases where more instructions can be faster (sometimes significantly so), for instance when getting around the processors abysmally slow idiv instruction, however these aren't the norm and getting around them requires highly specialized code which works only if you can properly constrain your inputs.

Share this post


Link to post
Share on other sites
Footnote:

1) There are particular cases where more instructions can be faster (sometimes significantly so), for instance when getting around the processors abysmally slow idiv instruction, however these aren't the norm and getting around them requires highly specialized code which works only if you can properly constrain your inputs.

To put this into perspective for everyone else, consider that integer division is a very slow operation (Josh used the adjective "abysmally" which is more correct :)). For an Athlon 64, it takes 42 cycles to divide a 32-bit integer, or 72 cycles for a 64-bit integer. For comparison, operations like addition, subtraction, multiplication, and shifting all take 1 cycle.

For our layer composition engine, which is the code that handles blending of layers ("Blend Mode:" in Layer Properties), we have to do 3 divisions per pixel in order to have accurate blending. An 800x600 image has 480,000 pixels. Multiplied by 3, then 42 = slow. For v2.6 we have a brand new layer composition engine which implements an optimization that hides a lot of the cost of the integer division.

It takes many more instructions to execute but ends up being 80-100% faster (iirc). It involves retrieving 3 values from a lookup table and then doing a multiply, add, and shift-right operation instead of a division. So "n / d = ((n * M) + A) >> S)", but the restriction is that d can only be in the range of 0 to 255. The table lookup takes awhile but the multiply, add, and shift operations are all documented to be 1 cycle each. This technique is described and implemented in the AMD PDF titled, "Software Optimization Guide for AMD64 Processors" section 8.8, "Derivation of Algorithm, Multiplier, and Shift Factor for Integer Division by Constants".

So in summary, you turn this code:

y = n / d

In to this code:

i = d * 3;
M = t[i];
A = t[i + 1];
S = t[i + 2];
y = ((n * M) + A) >> S;

Where 't' is a table containing 768 32-bit unsigned integer values. Now, remember where I said we have to do 3 integer divisions per pixel? Well, luckily we only have to do the table lookup once per pixel (the "M =..., A=..., S=..." portion), so that cost is amortized a bit.

Also, this has to be implemented using 64-bit arithmetic, otherwise the "n * M + A" part would overflow and ruin everything. So here again there's a boost on the 64-bit front. It's faster on 32-bit, yes, but 64-bit really takes off.

Share this post


Link to post
Share on other sites

SSE2 is great for integer ops, sure, but it would not be practical for us to use them. It would require placing some of our code into a "native" DLL written in C++ with inline assembly. Anytime we used that code we would incurr a managed->native transition which can eat up performance. Maintenance would also be very difficult, as assembly language is not very malleable. It would be more difficult to port Paint.NET (the 64-bit C++ compiler, for instance, does not support inline assembly at this time). It would also require having multiple versions of the code, one for CPU's that have SSE2 and one for CPU's that do not. I did this on a previous project and while it's fun to learn about and work with SSE2 and assembly language, it becomes a maintenance nightmare very quickly. I actually had up to 4 versions of functions on that project (normal, MMX, SSE, and SSE2 versions).

Right now we're actually using the C++ preprocessor on a C# source code file laden with macros to generate all the code for our blend modes. This has saved us an incredible amount of time because the code for each blend mode is very similar. I can simply write code like the following,

#define SCREEN(A, B, r) \
{ \
   INT_SCALE((, (A), r); \
   r = ( + (A) - r; \
}

DEFINE_STANDARD_OP(Screen, SCREEN)

And then I get about 400 lines of code generated for me. Multiply this by the number of blend ops we have. Then if you figure in how much code (at least 2x) and time (at least 5x) would be required to optimize it for SSE2 and you quickly come to the conclusion that it's just not worth it. Plus, we'd probably only see about a 10% improvement anyway.

I don't know if .NET will compile MSIL to take advantage of SSE2. I'll have to ask around on that.

Nowadays we're getting more of a benefit by concentrating our efforts on thread-level parallelism (TLP) instead of instruction-level parallelism (ILP). In other words, a dual processor or dual core (e.g., Athlon X2) processor runs Paint.NET very fast.

Share this post


Link to post
Share on other sites

Yeah I can see that four versions would be a pain in the < no swearing >. I read that inline MMX is very difficult, but I know very little about SSE1/2/3 really.

I can see your point, and I'd guess the managed code does translate for SSE2...but I wouldn't honestly know.

Share this post


Link to post
Share on other sites

What ranges do you get in the division code? just 8-bit or up to 16? Because why not (unless it would overflow) break up the 32 bit so you don't have to work with all those zeros? like instead of 000000000000000011111111 a 255 in 32 bit, 0000000011111111 with it broke down to 16 bit, of course I have no idea what the range of numbers in that equation is. 1111111000000001 is 255 squared and a 16 bit number (but still room for 510 more (decimal)) I think i have no idea what i am talking about so enlighten me.

Share this post


Link to post
Share on other sites

hmm I thought I posted this question earlier, but I can't find it...

how come PNG doesnt get a compression slider like jpeg? I mean, thats been true in every program I've seen, so I'd think there would be a good reason, I just don't know it

Share this post


Link to post
Share on other sites

PNG doesn't have a slider because PNG's don't have any compression options. Even Photoshop doesn't ask you for any settings when you save as a PNG.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this