It is not certain if what AMD writes there is true, because it is almost impossi...

eqvinox · on March 2, 2025

It doesn't really matter if the two "halves" are issued in sequence or in parallel¹; either way they use 2 "slots" of execution which are therefore not available for other use — whether that other use be parallel issue, OOE or HT². To my knowledge, AVX512 code tends to be "concentrated", there's generally not a lot of non-AVX512 code mixed in that would lead to a more even spread on resources. If that were the case, the 2-slot approach would be less visible, but that's not really in the nature of SIMD code paths.

But at the same time, 8×256bit units would be better than 4×512, as the former would allow more thruput with non-AVX512 code. But that costs other resources (and would probably also limit achievable clocks since increasing complexity generally strains timing…) 3 or 4 units seems to be what Intel & AMD engineers decided to be best in tradeoffs. But all the more notable then that Zen4→Zen5 is not only a 256→512 width change but also a 3→4 unit increase³, even if the added unit is "only" a FADD one.

(I guess this is what you've been trying to argue all along. It hasn't been very clear. I'm not sure why you brought up load/store widths to begin with, and arguing "AMD didn't have a narrower datapath" isn't quite productive when the point seems to be "Intel had the same narrower datapath"?)

¹ the latency difference should be minor in context of existing pipeline depth, but of course a latency difference exists. As you note it seems not very easy to measure.

² HT is probably the least important there, though I'd also assume there are quite a few AVX512 workloads that can in fact load all cores and threads of a CPU.

³ wikipedia claims this, I'm finding conflicting information on how many pipes Zen4 had (3 or 4). [Ed.: this might be an error on wikipedia. ref. https://www.phoronix.com/image-viewer.php?id=amd-zen-5-core&... ]