Why would you need to do that though?
Seriously -- processing left->right, top->bottom takes most advantage of cache locality. Anytime you access a pixel in memory, the incurred cache miss will bring in the surrounding pixels as well. But 'surrounding' is based on the memory layout, and in our case that means horizontally.
So having left->right as the inner loop exploits this. When you reverse the loops, you nullify your cache. So let's say you start at (0,0) and run all the way to (0,1023). Every one of those memory accesses has brought in pixels (0,n) through (31,n) [for instance]. By the time you get to the bottom, just going based on a simple LRU cache ejection algorithm, you're already kicking out the top rows' cachelines. So when you start your next loop at (0,1) you're back to cache misses. Which means you're doing 8 to 32x the memory bandwidth, depending on the cacheline size of your processor.
But if you access (0,0) and then (0,1) right away, you do a cache miss and then a cache hit: much faster.