So, with AMDs past announcement of Stacked 3D “V-Cache”, Intel is now back with Sapphire Rapids Xeon HBM memory. This is going to be an interesting few years to see how these CPU-close, off-die/on-package high speed RAM/cache will end up working out. Last time around Intel tried this (excluding Phi, different target audience) it worked, sorta, but was on the desktop and didn’t particularly provide real significant difference in most real-world use.
Are you referring to the eDRAM stuff on the Haswell/Broadwell/Skylake cores for Iris GPU acceleration and L4 cache?
Fast memory close to the CPU sure seems to be working for Apple in the M1… bandwidth on that chip is nuts.
Yeah, referring to the eDRAM as Intel’s previous general purpose x86 (Phi was a bit specialized).
Phi was a GPU that sucked as a GPU, so they tried it as a compute card, for which it was… OK, if really weird.
The eDRAM isn’t terribly tightly coupled, though closer than main DRAM and certainly faster at the time.
Yeah, I’m curious to see how the Intel/AMD versions get managed and perform. I’m sure for some HPC/HFT folks will love it and extract maximum use, but curious how the rest of the market will end up making use of them, and if they’ll be worth it across a larger number of folks, or if it’ll be only relatively niche that can take advantage of it.
Would you like your CPUs faster? This is how you get faster CPUs for most people…
One of the reasons the M1 is so blazingly fast is that it has stupidly large amounts of fast L1 cache - 192kb L1I, 128kb L1D on the performance cores, 128kb L1I/64kb L1D on the efficiency cores. Oh, and 12MB L2 shared among the performance cores, 4MB for the efficiency cores.
A modern Intel chip (i9-9900K) has 32kb L1I, 32kb L1D per core, but those are hyperthreaded cores so cut that in half, practically. There’s a whopping 2MB of L2 (256kb per core, vs 3MB per performance core for Apple), and 16MB L3. A modern Intel chip can run really fast, and wait for L1 at 5Ghz very, very nicely. Part of the whole speculative execution is to guess what should be fed into the caches by the time you need it.
A lot of fast memory should help quite a bit, though without the L1/L2 cache infrastructure to make use of it, it’s not going to change a ton.
Wow, didn’t realize the L1I/D caches were so large on the M1. I mean, I’m sure I read it, but didn’t really connect and compare with the Intel/AMD L1’s.
Do you think Intel/AMD will end up with some consumer side mix of higher throughput/power efficient cores at some point in time?
Intel is trying with with Lakefield (The Intel Lakefield Deep Dive: Everything To Know About the First x86 Hybrid CPU - Print View) and Alder Lake (Intel Alder Lake: Confirmed x86 Hybrid with Golden Cove and Gracemont for 2021)
I’ve no idea if it will be any good, and the code certainly isn’t mature on the x86 side for that, but it’s a thing.
I would love to read a report that used performance counters on a M1 and common x86 chip to see cache layer hits, but I don’t feel like doing that work myself. But remember, the L1 on the M1 is larger than the L2 on an Intel chip!
My impression is that the M1 doesn’t spend much time sitting around waiting on things - when a cycle clocks, the chip is ready, with the data needed, and can really chew on things. With the split L1 on a hyperthreaded Intel core, your L1 hit rate has to be poor.
Other than exposing the differences between cores, I don’t know what else would be needed. Linux already has support for big.LITTLE. One interesting difference is that all the ARM SoCs have equal numbers of big/little cores, but Intel is going to ship 4 big with 1 little. Maybe that will limit its usefulness.
HT can be turned off, and these days the scheduler is smart enough to leave adjacent threads idle until all the physical cores are busy. If you only load cores and not threads, you shouldn’t have any trouble with resource conflicts.
I don’t think the M1 has a substantially better hit rate or else you’d expect it to score substantially higher against Intel chips than it does. Intel chips really suck at memory. HSW/SKL/SKX are limited to 2 loads + 1 store per cycle, and they bottleneck on AGUs. M1 can do 3L1S or 2L2S, and the latency is less, not only in cycles but in nanoseconds. In SPECint it beats yesteryear’s Skylake++++++++ by meager margins. In most tests the difference is 5-10%, and the M1 is not always the winner. But, yes, being able to wait faster isn’t much of an advantage for Intel.
ICL also has 512K L2, so it is not true that M1 has more L1 than Intel has L2.
It’s interesting to me that AMD cut its L1I and L1D to 32K in Zen after 20+ years of 64K. I’m curious why they went in the opposite direction. It may be a necessary sacrifice to hit higher clocks, or it could be that they’re targeting different workloads.
I don’t think it will - the old big.LITTLE core cluster switching is long since gone, with schedulers being aware of all the cores. My PineBook Pro has two big cores and four little cores, and the schedulers seem fine with it. I agree a single big core may be of limited use, but we’ll see.
The M1 is Apple’s first desktop chip for low end stuff and is hitting with quite high end Intel stuff. We’ll see how the M1X or M2 perform, but I expect some nice crushings…