Branch predictor: How many "if"s are too many?

I love these guys and their weird corners of production optimization…

Reminds me of a blog I read recently about how “data oriented programming” is much more amenable to modern CPU caches - instead of “object oriented” where you have Cat1.name,Cat1.sex,Cat1.color,Cat1.weight,Cat2.name … you instead have all the cat names in one structure, all the weights in another, because most likely your operations will be performed on all names at once, which having them in contiguous memory makes it much easier for the cache to understand what to do.

We’re at the cresting height of a processor-fueled performance climb - the next twenty years are going to be mainly improvements in software languages and design I feel - more processing of code and data to suit the CPU instead of the other way around. I personally suspect a lot of the M1 performance is simply based on customizing the chip for the types of software that are run today; but customizing the software for what chips do well is also available (though trickier for individual programmers, likely).

That’s the fast way to do things for GPU programming as well. Put all your related data elements together as they’ll be accessed in parallel and the memory systems are really good at that.

Yeah I think the GPUs and their programming are at the forefront of where we’ll need to go on general purpose CPUs if we want to keep doing Moore’s Law impressions.

I suspect we’re at another turning point where total processing POWER isn’t the biggest issue anymore; it’s power per watt and other considerations.

I’m pretty sure that ship has sailed, hit a reef, and is sinking. I don’t expect more than a doubling or two of performance until we hit the wall on general purpose CPUs. GPUs already practically glow.

You hit the problem of “You can’t do subatomic transistors” here pretty soon, and I’m not aware of any real way around that.

On the other hand, if the style was for leaner, more efficient uses rather than “bloat all the code with 650 nested dependencies and ship a browser with it!” that makes up modern “applications,” we might get somewhere.

I mean, fundamentally, I use a modern computer to do most of the same things I used about a Pentium 3 for, and only a bit more than I used a 486 for.

Yeah, and we’re hitting other practical limits - I have something like 20 million pixels visible on my monitors and I guess could get to 40 or 80 million with retina-like display enhancements; but beyond that it really becomes pointless.

It’s the dirty secret of tech - we really haven’t had a major advancement since the Internet.

It’s sad, too, because when computers were underpowered there was a great drive to use them more efficiently (because there was more you could do) - most everything is “good enough” and so there isn’t the same drive to increase software performance.

Hrm. How many pixels do I have in my office right now?

Looks like about 26.5M or so.

I’d argue that smartphones were a major shift - a pocket computer with always-on internet access is different enough from a laptop that I consider it a separate category of device. That was over a decade ago.

Smartphones were the impetus that caused performance per watt to became even a commonly-known term, let alone something that was considered important by more than niche players.

I think that while they and the related radio tech were definitely a substantial shift in our technological toolkit, they weren’t a massive leap in anything but energy consumption and physical space (which turned out to be quite related). There hasn’t been a massive shift in compilers suddenly solving all our problems, as has been 10 years out for the past 40. There hasn’t been a new language that transformed computing, though lots of fanbois might like to claim theirs has.

The Internet was, IMO, more of a social-relationship-to-computing change than a major advancement of technology. We switched from in-person/landline bidirectional and print/radio/TV one-way to cellphone/online two-way communications via the concepts incorporated into the Internet. The idea of routing messages wasn’t new or transformative, the idea of digital communications wasn’t new or informative, but the synthesis of these concepts applied to the burgeoning corporatization of life hit a nerve psychologically in the West in particular and because of that people poured immense amounts of energy into it.

And, like NASA, the need for more, bigger, smarter, faster, etc. pushed boundaries and drove the development of technology, so in a sense there were carry-on effects from it. But I don’t see that as one collective innovation as much as the result of a massive, sudden, and nearly universal perspective change by humans. Or, to phrase it another way: if humans hadn’t grasped onto the idea that communicating with this technological tool was suddenly of utmost importance, the tech itself might have ended up there anyway, just much more slowly and with a very different end result, and perhaps far fewer users. So, yes, I agree we haven’t had a major tech advancement since the Internet, but I don’t consider the Internet a “tech advancement” as much as the accelerated impact of humanity taking a sharp left turn at speed and collapsing a lot of what would have been decades or more of incremental and not-necessarily particularly valuable work into a short timeframe, making it look more important and connected than it really is.

We’ve hit saturation in so many metrics - every Ethernet upgrade for me up to 1G was really noticeable - but my 10Gb backbone I have now didn’t really “do” anything noticeable. Once my internet was fast enough for decent quality YouTube streams I don’t really notice “faster”.

Same thing with storage - with 10TB on a NAS I’ve not had to worry about it anymore. Same on phone speed and camera and storage there, too. Wifi speeds have increased and I’ve honestly not noticed.

I’m not sure what will drive my next upgrade, if anything.

From my perspective JavaScript has transformed computing. I’m not sure it’s a major /advance/ since the internet, but having a JS JIT nearly everywhere has gone a long way towards not caring about memory usage and not even knowing about how code actually gets executed… Or what code your code really depends on. Or having a clue what security holes it had… Most of that isn’t because of the language but because of where the language was placed. Makefiles and C in every browser could have done something similar, but it didn’t.

I do see the internet as a major advance (with costs), although perhaps it is better described as many distinct changes, some of which are advances. Partly I speak of an internet I’ve only read about, but I’m pretty sure I picture something different from the average American or even the average American programmer.

Being able to share documents, programs, ideas, conversations across the country or the world, overnight, with as many people as are interested (and have access), for cheap enough no one bothers tracking each copy… That’s big. Partly that’s the connectedness of the internet. Partly that’s being able to send without destroying the original.

Later being able to find information at 1999 speeds instead of 1991 speeds was another big deal, although part of that might have been my access and age across a decade. Books had depth and breadth. Research articles had depth and were suddenly available to random high schoolers, not just someone with a university library nearby.

I think the Pentium 4 laptops also convinced people that there might be something to a bit of the performance per watt thing, at least in terms of laptops that didn’t try to light your lap on fire and melt out the solder in their power pins. It was followed by the Pentium M and such which turned into the Core/Core 2 line, and they were far better. Saved Intel, even. Now, if you want some screwed up frontends and pipelines, the Pentium 4s were good fun.

Indeed. This is a thing that a lot of people miss, willfully or not. At some point, there’s a limit in that you’re interfacing with humans. Going to retina displays, yeah. There was a difference. Doubling the pixel count off that? I don’t think I could practically see a change. And technologies end up leveling off as you hit either the limits of what you can do, or the limits of what makes any sense (or that anyone can afford). Look at the costs for new technology IC processes - they’re insane, and an industry that started with “neighborhood chip foundaries” (not quite, but pretty close) has consolidated to only a few global companies because the costs are so high.

End of software support for something or other, but even then, most of the major transitions are past. I figure my iPhone 2020 SE (finally broke down and bought one after my 6S was dusting up again) will have 6-8 years of software support with a mid-life battery refresh.

This is basically my point: languages and applications like JS and the browser haven’t changed “computing” at all - they’ve just increased the number of situations we’ve found a way to apply it to. There’s nothing at all novel or revolutionary about either tech, they’re just another iteration and spreading application of things we already had.

Well this is the social thing I was talking about. Again, the fact that computing enabled this doesn’t mean it changed computing itself. Just another rapid uptake of an existing technology once it became clear that the technology could be applied to that situation.

Well I don’t know of any case where sending destroys an original. Usually sending involved taking the original physically and moving it to another place, or manually copying it into a new “original”. The numerification of this process in no way makes it novel, either - people transcribed information by radio before the computer, by mimeograph before that, by morse and lights before that, etc. The only thing that changed was the scope of the communications network and the mechanism by which it operated, both of which sped it up.

I think we largely agree, in that the changes that were impactful on our society had to do with our relationship to technology, where we applied it, and what we expected of it (e.g., “social” changes) rather than fundamental breakthroughs in the technology itself (which, basically as soon as the transistor was recognized as a useful switch, was a given, since the technology for computing with switches was already in existence at that point and had been for nearly a century – or since antiquity if you draw the line somewhat fuzzily at the logical foundation for the idea).

So to me the big technological advances related to computing were:

  1. Boolean algebra (and the subsequent derivation of that into mathematical operations using binary logic). This was the first and fundamental leap - to be able to apply an algebra to discrete-state events.
  2. The transistor (and subsequent derivations of the various gate forms and manufacturing processes that enabled waferization of circuits, etc, all of which I view as incremental, not revolutionary). We had binary computers before the transistor, but this made it possible to scale them.
  3. Sampling theory (and subsequent derivation of that to discrete-time mathematics and signals processing). This was the last necessary leap: to apply real-world signals to the discrete-time nature of fixed-step computing in a useful manner.

On the backs of these three fundamental developments pretty much all of modern computing and its application rests.

You can change materials, such as to GaN. (This is very hard.)

You can also make optimizations to the transistors, such as 3D transistors, which are currently being teased. This has its own set of issues, but it reduces prop delays between gates.

Then there are all sorts of materials optimizations that don’t involve shrinkage.

So, yeah, we’re obviously approaching the size limit, and Moore’s Law is completely dead, but there are still a few more possibilities for eeking out more small gains in transistors.

I’m pretty sure they’re full of crap. This is one of those “you know enough to be dangerous” situations.

AFAIK the BTB never caches branch targets for direct, unconditional branches. This would be idiotic since they go to the same place every time. The BTB does predict the target of an indirect, unconditional branch, and it predicts direction for direct, conditional branches. The x86 machine does not have indirect, conditional branches, though other architectures do.

Also, there is a separate return stack predictor that has existed in x86 chips for at least 15 years if not more. I don’t think the target of a RET instruction is ever saved in the BTB. This wouldn’t even make sense since it’s often unpredictable. (Where is malloc() likely to return to?) Instead, the return stack predictor predicts that the return will branch to the instruction after the most recent CALL in its LIFO stack.

They also seem unaware that the BTB has two modes: static prediction (no history) vs. dynamic prediction, and they don’t seem to be taking static prediction into account. If your branches are highly predictable but statically mispredicted, then you will see much higher latency on the first execution as the BTB is populated.

I will have to think a bit on their tests, but I’m pretty sure every last one of them is poorly constructed and isn’t measuring what they think they’re measuring.

Is this something that might be easier to tease out on a simpler ARM core with a basic branch predictor, like an M7 or something? I’m just getting my hands dirty with that chip and I’ve been quite curious what its (I assume rudimentary) branch predictor does. If you’ve got ideas for writing better tests I may have some time to try them out on that chip but at the moment this is all fairly new to me - I’ve known for a long time that branch predictors exist, what they attempt to do, and various basic principles around how they might go about it, but I have no idea how to surface the details in way that strips off the “magic” to make it more understandable.

Since I do realtime stuff, this would be helpful for me to learn, as well.

The details of branch prediction are much like the details of caches. The high-level concepts are the same, but the hash functions aren’t typically publicly disclosed. You might be able to suss out some details from patent filings if you’re lucky.

As far as I am aware, the general rules are the same for everyone – across architectures, vendors, etc. There are two basic kinds of predictions: predicting whether a conditional branch will be taken or not and predicting the target of an indirect branch.

Conditional branches are always statically predicted taken if backwards and not taken if forwards. This is what happens when there is no history for the branch. You can get measurable gains just from organizing your code so that static prediction favors the more likely side. After that, dynamic prediction will take over. The original algorithm was a 2-bit counter, but nowadays it’s rather complex. Intel can detect and predict a loop with a fixed number of iterations up to 16. Keep in mind that (mostly) every branch has separate history, so sprinkling your code with

if (debug) { … }

could result in a lot of extra mispredicts, which adds up to a lot of wasted cycles. Worse, the branch prediction data you evict may cause other branches to mispredict more often, compounding the performance loss. Some x86 chips like the K8 had limits on branch density. I don’t know if this continues to be the case.

Intel chips statically predict indirect branches as not taken. This is rarely the case, but there is simply no way to guess where these will go. Indirect branches therefore always statically mispredict. Originally the BTB would simply track the most recent branch target, though this predictor has also evolved. I believe, though am not certain, that they can now track multiple targets for multiple branches.

Indirect branches show up in dynamic dispatch, such as in dynamic linking or in polymorphism, like C++ classes. Intel’s optimization manuals discuss various techniques to improve performance, most of which I expect would apply to every architecture and every vendor. It’s also worth reading about comptued goto. Especially with ARM it can be beneficial to entirely eliminate branches and completely avoid consuming BPU resources, but branch-free logic is a complex topic.

Ok, first off, wow, thanks for this. It’s given me a lot to think about. I hope you don’t mind a few questions.

What, exactly, do you mean “if backwards”, how is a backwards branch generated in normal code that isn’t a “goto”? (I avoid writing with gotos, personally, though I do know some cases when they are an elegant way to handle certain return situations - I tend to just factor my code so things like that are not necssary.)

Also, regarding favouring not taking forward branches, in a typical

if (foo) {
  ... foo handling
}
... other code

Is running foo handling taking or not taking the branch? Or do I need to look at the generated ASM to figure out what the compiler did with it? Early return style is HEAVY in our codebase, and we try to keep if and loop nesting to 2 deep or less whenever possible. This means we face a lot of branches I need to consider for this.

Also… ARM has a few “conditional execute” flags for instructions which are often subbed in place of a short branch in the same code line (ternaries can be resolved this way frequently, as can simple if(foo) { x++; } else { x--;} type things. I assume these don’t involve the branch unit, please correct me if I’m wrong here.

What is it about ARM that makes this an ‘especially’? I haven’t noticed any serious performance hit from having normal code with branches - other than the fact that branches tend to reliably cost me 3-5x what a typical single line does, whether it’s an M4 or an M7, on average, YMMV etc. But that’s to be expected with a pipeline clear, PC flush, and usually an implicit DSB or similar.

For me, doing hard realtime, as long as I can put a guaranteed upper bound on what I do, the actual cost is more an optimization than a serious concern, and branches have an upper bound on the M7 (if you’re running from TCM and/or the CPU has priority on the AXI SRAM), so I’m good with that.

I’ll add that quite often this gets compiled into branching by NOT taking the IF, in my code at least - IOW if you skip handling foo, you branch to get to the rest of it.

Loops. They’re almost always ended with a “If the ending condition isn’t met, go back to the start” sort of sequence, and if you’re using loop constructs, it’s a good bet they’ll loop at least once.

Those aren’t branches - they’re just conditional instructions and won’t involve the branch predictor at all.

Doh! Didn’t even think of that case. Thanks.