It is not certain if what AMD writes there is true, because it is almost impossible to determine by testing whether the 2 halves of a 512-bit instruction are executed sequentially in 2 clock cycles of the same execution unit or they are executed in the same clock cycle in 2 execution units.
Some people have attempted to test this claim of AMD by measuring instruction latencies. The results have not been clear, but they tended to support that this AMD claim is false.
Regardless whether this AMD claim is true or false, this does not change anything for the end user.
For any relevant 512-bit instruction, there are 2 or 4 available execution units. The 512-bit instructions are split into 2 x 256-bit micro-operations, and then either 4 or 2 such micro-operations are issued simultaneously, corresponding to the total datapath width of 1024 bits, or to the partial datapath width available for a few instructions, e.g. FMUL and FMA, resulting in a throughput of 1024 bits of results per clock cycle for most instructions (512 bits for FMA/FMUL), the same as for any Intel CPU supporting AVX-512 (with the exception of FMA/FMUL, where the throughput matches only the cheaper Xeon SKUs).
The throughput would be the same, i.e. of 1024 bits per cycle, regardless if what AMD said is true, i.e. that when executing 8 x 256-bit micro-operations in 2 clock cycles, the pair of micro-operations executed in the same execution unit comes from a single instruction, or if the claim is false and the pair of micro-operations executed in the same execution unit comes from 2 distinct instructions.
The throughput depends only on the total datapath width of 1024 bits and it does not depend on the details of the order in which the micro-operations are issued to the execution units.
The fact that one execution unit has a data path of 256 bits is irrelevant for the throughput of a CPU. Only the total datapath width matter.
For instance, an ARM Cortex-X4 CPU core has the datapath width for a single execution unit of only 128 bits. That does not mean that it is slower than a consumer Intel CPU core that supports only AVX, which has a datapath width for a single execution unit of 256 bits.
In fact both CPU cores have the same vector FMA throughput, because they have the same total datapath width for FMA instructions of 512 bits, i.e. 4 x 128 bits for Cortex-X4 and 2 x 256 bits for a consumer Intel P-core, e.g. Raptor Cove.
It is not enough to read the documentation if you do not think about what you read, to assess whether it is correct or not.
Technical documentation is not usually written by the engineers that have designed the device, so it frequently contains errors when the technical writer has not understood what the designers have said, or the writer has attempted to synthesize or simplify the information, but that has resulted in a changed meaning.
It doesn't really matter if the two "halves" are issued in sequence or in parallel¹; either way they use 2 "slots" of execution which are therefore not available for other use — whether that other use be parallel issue, OOE or HT². To my knowledge, AVX512 code tends to be "concentrated", there's generally not a lot of non-AVX512 code mixed in that would lead to a more even spread on resources. If that were the case, the 2-slot approach would be less visible, but that's not really in the nature of SIMD code paths.
But at the same time, 8×256bit units would be better than 4×512, as the former would allow more thruput with non-AVX512 code. But that costs other resources (and would probably also limit achievable clocks since increasing complexity generally strains timing…) 3 or 4 units seems to be what Intel & AMD engineers decided to be best in tradeoffs. But all the more notable then that Zen4→Zen5 is not only a 256→512 width change but also a 3→4 unit increase³, even if the added unit is "only" a FADD one.
(I guess this is what you've been trying to argue all along. It hasn't been very clear. I'm not sure why you brought up load/store widths to begin with, and arguing "AMD didn't have a narrower datapath" isn't quite productive when the point seems to be "Intel had the same narrower datapath"?)
¹ the latency difference should be minor in context of existing pipeline depth, but of course a latency difference exists. As you note it seems not very easy to measure.
² HT is probably the least important there, though I'd also assume there are quite a few AVX512 workloads that can in fact load all cores and threads of a CPU.
Some people have attempted to test this claim of AMD by measuring instruction latencies. The results have not been clear, but they tended to support that this AMD claim is false.
Regardless whether this AMD claim is true or false, this does not change anything for the end user.
For any relevant 512-bit instruction, there are 2 or 4 available execution units. The 512-bit instructions are split into 2 x 256-bit micro-operations, and then either 4 or 2 such micro-operations are issued simultaneously, corresponding to the total datapath width of 1024 bits, or to the partial datapath width available for a few instructions, e.g. FMUL and FMA, resulting in a throughput of 1024 bits of results per clock cycle for most instructions (512 bits for FMA/FMUL), the same as for any Intel CPU supporting AVX-512 (with the exception of FMA/FMUL, where the throughput matches only the cheaper Xeon SKUs).
The throughput would be the same, i.e. of 1024 bits per cycle, regardless if what AMD said is true, i.e. that when executing 8 x 256-bit micro-operations in 2 clock cycles, the pair of micro-operations executed in the same execution unit comes from a single instruction, or if the claim is false and the pair of micro-operations executed in the same execution unit comes from 2 distinct instructions.
The throughput depends only on the total datapath width of 1024 bits and it does not depend on the details of the order in which the micro-operations are issued to the execution units.
The fact that one execution unit has a data path of 256 bits is irrelevant for the throughput of a CPU. Only the total datapath width matter.
For instance, an ARM Cortex-X4 CPU core has the datapath width for a single execution unit of only 128 bits. That does not mean that it is slower than a consumer Intel CPU core that supports only AVX, which has a datapath width for a single execution unit of 256 bits.
In fact both CPU cores have the same vector FMA throughput, because they have the same total datapath width for FMA instructions of 512 bits, i.e. 4 x 128 bits for Cortex-X4 and 2 x 256 bits for a consumer Intel P-core, e.g. Raptor Cove.
It is not enough to read the documentation if you do not think about what you read, to assess whether it is correct or not.
Technical documentation is not usually written by the engineers that have designed the device, so it frequently contains errors when the technical writer has not understood what the designers have said, or the writer has attempted to synthesize or simplify the information, but that has resulted in a changed meaning.