Branch predictor: How many "if"s are too many?

What Syonyk said. To be a bit more formal, a “backwards” branch is one whose branch target is a lower address. Direct branches are essentially instructions that add a value to the instruction pointer / program counter. If the amount being added is negative, then the branch goes backwards. Usually code blocks are laid out in roughly the same order that they appear in your source; and in the source, the only way to branch towards the top of the function is with a loop.

The answer here is very complex. It depends on your compiler and on how it predicts which way the branch is most likely to go. Microsoft’s compiler is particularly dumb and assumes that the body of the if is likely to execute. I believe I have seen GCC on occasion decide to drop the body of the if to the end of the function, favoring the not-taken case. You can help GCC with the __builtin_expect() function, which tells GCC what you think is the most likely result of some expression. Linux defines likely() and unlikely() macros to simplify this. However, the best way to get it to generate good code is to run your application on a good, representative data set and collect PGO data. GCC will use this data to rearrange the blocks so that the processor statically predicts the most likely path through the function.

There’s a nice SO question on the likely / unlikely macros: gcc - How do the likely/unlikely macros in the Linux kernel work and what is their benefit? - Stack Overflow

One thing to note is that a C expression can itself cause a branch. For example: int x = *p || *q. The C standard requires that && and || operators short-circuit, which is to say that the right-hand expression is not evaluated if the left-hand expression is 0 in the case of && or 1 in the case of ||. The previous example may be rewritten as int x = !!*p | !!*q; the result is the same, but *q will be evaluated even if *p is non-zero. This usually avoids a branch.

Here’s an example: https://www.godbolt.org/z/T31T6os7K. In the second version of the function I dereferenced the pointer outside of the tertiary operator so that the compiler would not need to use a branch. GCC chose to use a branch, anyway, so I added the inline asm to trick it. Note that it’s not actually emitting any code, but GCC only looks at the constraints and sees that it needs to put rval into a register.

This is called predication. Branches cause the instruction pointer to (potentially) go somewhere other than the next instruction, and this is precisely what makes them painful for CPU designers. Predication doesn’t affect control flow; it only affects whether the current instruction executes.

It’s especially relevant on ARM because most instructions can be predicated. The cost of a branch can be modeled as: predict_cost + ((mispredict_cost - predict_cost) * mispredict_rate). If your mispredict cost is 20 cycles and the branch is mispredicted 50% of the time, then on average your branch costs 10 cycles. To this you would add the average cost of executing each side of the branch. If the cost of executing both is less than the result, then branch-free logic is faster.

I’ve never worked with the Cortex-M line, so I don’t know much about those. They might not even do dynamic prediction since it would cause the timings to be non-deterministic.

Excellent information, thank you!

Since we run on embedded targets, I’m not aware of a way to do this - but that raises a thought. If I can generate a test harness that executes this code using captured data - on a different cpu - can the profile data be used to optimize the cross-compiled target or is it no longer useful by changing architectures? I’ve not messed with this much - it’s not common among my set of mentors - so I have no idea how it really works except in theory.

Are there cases where distance affects branch cost? E.g. in the example you provided the branch is only a couple of instructions ahead - but I wonder why GCC put that in there if branches cost so much, that seems like making three or four instructions take 10 is a bit terrible. Is there some thought process behind this - especially when -O3 explicitly includes things like -fif-conversion ?

That is an interesting question whose answer I don’t know. If they’re just collecting data on expressions in the source code, then it wouldn’t matter which architecture you’re on.

RISC machines generally can’t jump around the whole address space. ARM A32 gives you +/- 32MiB. Anything outside of that range will almost certainly be converted into an indirect branch. Other than that, branch distance should be irrelevant.

My guess is that some optimization phase rewrote the tertiary construct and discarded the information that the compiler was free to dereference the pointer.

I very much appreciate your comments here, thank you! I’ll take some time to think about these implications as I prepare our codebase to transition from M4 to M7, which has a significant difference in branch behaviour (both cost and prediction mechanism), as well as introducing caching into the mix too.