On the counter-intuitiveness of speed

Sunday 20 July 2008This is more than 16 years old. Be careful.

I had an idea this morning that I thought would make my Mandelbrot viewer Aptus run a little bit faster. The compute engine is written in C for speed, but with a Python progress callback function passed in to get updates on the state of the computation. The code is structured like this:

// Code A: .1 sec
for each scanline:
    compute the pixels
    call the progress function

The progress function is called once per scanline, so for the default 600×600-pixel view, it is called 600 times. Computing this default view takes .1 second.

My idea was to invoke the progress function less often. There’s no need to invoke it as often as 600 times in a tenth of a second. Since the progress function is written in Python, I figured I could save some time by avoiding some of that Python execution.

As a quick test of my thesis, I commented out the progress call entirely:

// Code B: .06 sec
for each scanline:
    compute the pixels
    //call the progress function

Now the computation took .06 seconds, a significant improvement! It looks like 40% of our time is spent reporting progress through a Python function.

The basic unit of computation for the Mandelbrot set is an iteration, and I was already counting the total number of iterations. So I changed the code to call the progress function only if a minimum number of iterations had been calculated since the last progress call:

// Code C: .1 sec
for each scanline:
    compute the pixels
    if min_progress_delta (1M) exceeded:    
        call the progress function

With this code in place, the computation still took the original .1 seconds. That’s odd. The total iterations in this case is 1.9 million, so we only exceed min_progress_delta once, and the progress function is only called once. How can this be? In Code A, we invoke the progress function 600 times, and in Code C we invoke it once, and yet they take the same amount of time. In Code B, we invoke it not at all, and it speeds up by 40%. How can the one call change between Code B and Code C make such a difference?

Odder still, suppose we change min_progress_delta to two million, so that the progress function is never invoked?:

// Code D: .1 sec
for each scanline:
    compute the pixels
    if min_progress_delta (2M) exceeded:    
        call the progress function

It still takes .1 second! More experimentation: comment out the call of the uncalled progress function:

// Code E: .06 sec
for each scanline:
    compute the pixels
    if min_progress_delta (2M) exceeded:    
        //call the progress function

Now it takes .06 seconds! How can commenting out code that is never called make a difference? We’re starting to zero in on the issue here: we didn’t simply comment out uncalled code, we commented out the entire body of the if clause. And that meant that our C compiler eliminated the test of the if since it was unneeded.

It’s beginning to look like the simple act of doing anything at the bottom of the loop is taking time. That’s the only explanation for the data we have so far.

I’m no expert in these matters, but I’ve read enough about pipelines and caches to know that this is entirely plausible. When the code is uncluttered with detours, it goes much faster than when the end of the loop pauses to consider whether to invoke the progress function. Ironically, actually calling the function and invoking all of the Python overhead is insignificant compared to the time lost to simply deciding (in C!) whether to call it.

Or am I missing something here? Is there a way to invoke my callback function without putting hiccups in my pipeline?

Comments

[gravatar]
I am reminded of a macro in the linux kernel giving the compiler some guidance about the probability that a branch will be taken. It might be worth taking a look at to see if that will reduce the number of missed branches.
[gravatar]
This article should be enough to get started http://kerneltrap.org/node/4705
[gravatar]
I would take a look at the compiler's output. This should give you a pretty good clue about what's really going on there. If gnu disassembly tools are not your cup of tea, you can also use the free versions of IDA to take a look.
[gravatar]
Something looks wrong with this problem. According to your pseudo-code, the test in question only runs once per scanline, which is just 600 times per run. I can't believe 600 mis-predicted branches add up to 40 milliseconds.

After speculating for a bit, I stopped being lazy and downloaded the current Aptus source. Nice code.

I see that preparation for the progress checking call involves multiple calls to sprintf. Ouch. That has to be a performance hit. There seem to be plenty of branches in here, on each iteration, so I wouldn't expect one mis-predicted branch is going to hurt so much. (Though I do notice that the progress function might be called more than once per scanline).

Definitely something fishy going on.

Do you get similar results when you run with non-optimised and optimised compiles?
[gravatar]
Noah Kantrowitz 5:27 PM on 20 Jul 2008
Another option would be to write out status info to a block of shared memory and let another thread read it out and update the GUI. Assuming a dual+ core CPU (which is pretty common now) this should help.
[gravatar]
These are all good suggestions, thanks.

@Alan: after I posted this, I also wondered about how any code executed only 600 times could have such a huge impact. Espectially since there are plenty of other conditionals in the inner loop (once per pixel), which should have a far larger effect. I guess there's still some mystery here...
[gravatar]
I would look at the generated assembly.
At first blush from the psudo code I would guess that it is the function call which is causing the problem (not the conditional). The existence of a function call may cause the compiler not to unroll the loop, and also cause more code to be inlined (changing the inline statistics for the higher function).

What you might be seeing is a difference between inlining the the parent function and not. For our engine code base we do a fair amount of inline tweaking to deal with issues like this. That is the only thing which I can think of which could account for the .4sec.
[gravatar]
I'm no gcc wizard, so I may be doing something wrong getting the assembly files, but when I compare the assembler generated by Code D and Code E, the differences are just that there's a chunk of code missing from Code E, corresponding to the commented out source code. That is, there's no larger compiler magic like loop unrolling being affected here.

BTW: I tried likely and unlikely also, and looking at the assembler output, it looks like they have no effect. Again, I may not be generating the output properly.
[gravatar]
OK, I feel a little sheepish about this, but it turns out I was compiling with only -O. Most examples of building Python extensions with setup.py show the flags as including -O3, but mine did not. I added -O3 (with extra_compile_args in setup.py), and most things make a lot more sense now.

First, the code just runs faster now, about .07 sec for a baseline. Second, the likely/unlikely macros now generate very different code, and the code runs about 4% faster with them than without them.

But we're not out of the woods yet. Code A now runs in .07 sec, but Code B runs in .09!! That is, comment out everything to do with the progress function, and the code runs slower. WTF?
[gravatar]
It does sound like a compiler thing (when it doubt blame the compiler!). The python function call overhead isn't that high, you should be able to do 50k+ function calls in that 0.04 secs.

My compiler-foo is weak but I looked at the source and nothing jumped out at me. The inner loop contains similar sprintf/python function call logic so unless that gets optimized out completely I would think that same logic in the outer loop wouldn't be 600 times slower.

unrelated: is the (s == 1 && self->trace_boundary) test repeated twice just for style? Not that it matters for speed, that's the kind of thing compilers optimize easily.
[gravatar]
@Jack: you're talking about this code:
if (s == 0) {
    c = compute_count(self, xi, yi);
    COUNTS(xi, yi) = c;
    num_pixels++;
    STATUS(xi, yi) = s = 1;
}
else if (s == 1 && self->trace_boundary) {
    c = COUNTS(xi, yi);
}

// A pixel that's been calculated but not traced needs to be traced.
if (s == 1 && self->trace_boundary) {
    ...
The tricky part here (and I've had to rediscover it myself a couple times) is that you can execute the if in the first clause, and the result could be that s is 1, so you may execute the second if without having executed the first else-if, if you see what I mean.
[gravatar]
ah, I missed the mutation of "s" in the first if - even though I was looking for it.
[gravatar]
I suspect you've spent too much time using Python when you start calling C variable assignments "mutations" ;-)
[gravatar]
An idea for using Mandelbrot as a visual way to measure the relative performance of c vs. pypy:

Would it be possible to make a version of Aptus where your c engine calculated the Mandelbrot set from left to right while at the same time a pypy engine calculated the Mandelbrot set from right to left? If c and pypy were exactly the same speed, then they would meet in the exact center of the image. If c were faster than pypy the they would meet somewhere right of center.

This would be a cool visual way to measure the speed improvements of pypy over time.
[gravatar]
@Chris: the flaw in your plan (aside from the logistic difficulties of using two implementations to create one image) is that different pixels take different amounts of time to compute, so there's no reason to think same-speed implementation would meet in the middle. In fact, the right half of the Mandelbrot set has more black pixels than the left, so it takes much longer to compute.
[gravatar]
Ahh... You learn something new everyday.

OK... Another proposal: C calculates top to bottom and pypy calculates bottom to top. Would that be a fair race?

My thought was that the two engines would calculate the pixel values but not actually draw the pixels. If this were the case then each engine could yield pixel values back the main thread which could draw the pixels into a common image. Something like a hybrid car with two engines under one hood.

Final proposal: (for extra credit) Four pixel value calculating engines.
- Divide the image in four quadrants.
- C engine calculates the topLeft quadrant going from top to bottom
- Pypy engine calculates the bottomLeft quadrant going from bottom to top
- Python 2 engine calculates the topRight quadrant going from top to bottom
- Python 3 engine calculates the bottomRight quadrant going from bottom to top
--- Assume that the two CPython engines will be the slowest so give them the easier right half to calculate.
--- Assume that the two CPython engines will be of equals speed but test them side-by-side to verify.
--- All engines would yield pixel values back the main thread which could draw the pixels into a common image.
--- The three Python engines run the same code to keep things equal.

Add a comment:

Ignore this:
Leave this empty:
Name is required. Either email or web are required. Email won't be displayed and I won't spam you. Your web site won't be indexed by search engines.
Don't put anything here:
Leave this empty:
Comment text is Markdown.