I did a quick search on https://gcc.gnu.org/bugzilla for any issues, of any status, tagged with missed-optimization and containing the term "store-forwarding" or "bubble sort" and didn't see it (maybe I'm not searching right?)
...as opposed to? Off the top of my head I can't seem to remember a single acronym I could possibly confuse it with, and now you got me wondering which common acronym I completely failed to learn.
Yes to -fprofile-use. That brought gfortran to essentially the same geometric mean as ifort with reasonable "good" flags for me, running a benchmark set which I think is supposed to show off proprietary compilers. ifort got little benefit from its equivalent.
Not sure if that would have an impact here. GCC is just unaware of the latency implications of store forwarding. I mean, it's definitely worth a shot, but you'd just be more or less hoping that your mentioned techniques disable the right optimization pass.
That's true, although I'm curious if any compiler really understands the microarchitecture at that level without being coerced by a compiler dev writing a pass (i.e. won't happen straight from a pipeline description)
Do you know of any tools for generating this, possibly with runtime data? Been wanting to do this ever since I learned about the feature but I don't want to do this by hand for dependencies.
Edit: It is possible that I just don't understand how to actually implement PGO
To me the more interesting meta-point here is that some algorithms are less microarchitecture friendly than others.
Algorithms with larger distances between data dependencies are less likely to stall.
For example I compared an insertion sort and see the same performance between O2 and O3:
void insertionsort(int *buf)
{
for (int i = 1; i < n; ++i) {
const int x = buf[i];
int j = i - 1;
for (; j >= 0 && buf[j] > x; --j) buf[j + 1] = buf[j];
buf[j + 1] = x;
}
}
If you've spent any amount of time on StackOverflow around C/C++/assembly, you could probably tell within the first three paragraphs of the answer who authored it. :)
Lots of things are slower at higher -O levels, many things get slower if you compile them for your native instruction extensions, and there are other surprises lurking all over the compiler. The only way to be sure is to sweep out the parameter space of your compiler, using benchmark results as your objective function.
Yes, -O3 tends to include a lot of features that increase code size, like aggressive loop unrolling. If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.
In the past, at a time when I worked on a very performance sensitive codebase that was also limited in scope, we compiled with -Osize and did all the loop optimizations we wanted manually (and with pragmas). That produced faster code than -O2 or -O3.
Regarding unrolling, -O3 contains -funroll-and-jam but not -funroll-loops. You may want one or the other, maybe both, depending on circumstances. I don't see much benefit from the available pragmas on HPC-type code unless for OpenMP, and "omp simd" isn't necessary to get vectorization in the places I've seen people say it is. Mileage always varies somewhat, of course. (Before second-guessing anything, use -fopt-info.)
Modern x86 CPUs have micro instr caches to store small loops (about 50 instr) and medium loops (~2k instr). Also, the bottleneck is usually the instruction decoding (Alder Lake made huge changes on that, so this might change).
In other words, loop unrolling is, more often than not, harmful.
Probably just some tweaks to O2 would be enough, after all people are selecting Os over O2 because they see better performance, and that should not be happening.
In the application I referred to, PGO was also used. However, that only applies -Os to cold code, and if what you're doing is very branchy, it can help even in the hot path.
From my experience, -O2 includes optimizations that improve performance in the majority of cases. Whereas -O3 contains additional optimizations that may improve performance but the results are less clear-cut - hence why they're included in -O3 as opposed to -O2. I've seen cases where -O3 is faster than -O2, and vice versa.
That's why its infuriating that vectorization was not enabled at -O2. Since x86-64 supports SSE2 as a baseline, every program has been under performing at -O2 for over 15 years. Without knowing this it turns out the recent introduction of -march=x86-64-v2 and v3 dont benefit either when both have AVX2 which goes largely unused. See pathetic test results here:
https://github.com/solvespace/solvespace/issues/972
Notice that v3 is no better and may even be a hair slower.
Vectorisation is not a clear win, and so doesn’t belong in O2. Certainly it can deliver massive improvements in some cases (as loop unrolling could in Pentium III days), yet can cause significant binary size increases (think icache costs) and makes loops with small iteration counts potentially much slower.
If for some reason you want O2 but with vectorisation you can enable it, and if you want O3 without inlining or loop unrolling (the same thing), you’re welcome to specify that.
The fact that they are changing to include some vectorization options by default for -O2 suggests otherwise. In particular SLP vectorization should have been there all along. Every graphics library 2d and 3d defines vector classes that will easily go faster with the simplest vectorization.
Arguments around code size have been largely irrelevant for quite some time, so long as performance increases.
If I put on a tinfoil hat, I'd suspect Intel was at least an influence. They're the ones adding AVX, AVX2, 512 and providing their own compiler to take advantage of them, while crippling AMD parts. That is until they switched to LLVM.
Code size always matters. The code cache in the cpu is far from infinite and long jumps will stall. Sometimes vectorization can shrink code, so yay. And, like you said, if performance increases, then yay again regardless of size.
Maybe it could use a different heuristic in O2? Depending on the function it's vital to have autovectorization for performance. Neither -O3 nor -O2 are perfect as a blanket option for all the functions in a big program.
Did you mean the P4? An extremely long pipeline meant loop unrolling helps it far more than it does x86 processors before or after. PIII is more similar to Core/2/i in performance characteristics.
...and then there's Atom, which is also an odd one - I believe it's more like a classic Pentium (P54).
I think bubble sort is kind of an optimization nightmare - a loop where subsequent iterations depend on each other - and the dependency is through memory. I'm pretty sure the compiler tries to extract some level of parallelism from the loop and fails miserably.
For a decade I used to compile MySQL myself, and always -O2 and never -O3. And a few other flags. Really made a difference in speed to do yourself, but a product like that can show compiler issues and the first two years were less than stable until the right mix was discovered.
I quite frequently find -O3 to produce slower code than -O2.
This is not actually surprising: the reliable operations are placed in 2, and the dodgy ones in 3. It is possible, even common, for 3 to be faster, but you can't rely on it. You have to measure.
I've always been under the impression that -03, -0fast and related "performance" tweaks are not recommended, and shouldn't be used unless the software in question is specifically written with them in mind.
Was it ever reported?
I did a quick search on https://gcc.gnu.org/bugzilla for any issues, of any status, tagged with missed-optimization and containing the term "store-forwarding" or "bubble sort" and didn't see it (maybe I'm not searching right?)