Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bubble sort slower with -O3 than -O2 with GCC (stackoverflow.com)
163 points by asicsp on Oct 17, 2021 | hide | past | favorite | 54 comments


this is pretty clearly an anti-optimization you should report on https://gcc.gnu.org/bugzilla/

Was it ever reported?

I did a quick search on https://gcc.gnu.org/bugzilla for any issues, of any status, tagged with missed-optimization and containing the term "store-forwarding" or "bubble sort" and didn't see it (maybe I'm not searching right?)


How to keep the O3 goodness without pessimizations: Use likely/unlikely liberally, unreachable(), and PGO.

If the code slows down, it's usually because the compiler has generated a bunch of code because it doesn't know what hot path to schedule for.

They will generate hundreds to thousands of instructions for factorial if you let them because it turns it into a loop, then vectorizes the loop.

Undefined Behaviour pub quiz question in there ^^


Just to clarify, in this context PGO = Profile-Guided Optimization [1].

[1]: https://en.m.wikipedia.org/wiki/Profile-guided_optimization


...as opposed to? Off the top of my head I can't seem to remember a single acronym I could possibly confuse it with, and now you got me wondering which common acronym I completely failed to learn.


Pose graph optimization


That seems like a bit of a reach given that the thread is obviously about GCC.


Yes to -fprofile-use. That brought gfortran to essentially the same geometric mean as ifort with reasonable "good" flags for me, running a benchmark set which I think is supposed to show off proprietary compilers. ifort got little benefit from its equivalent.


Not sure if that would have an impact here. GCC is just unaware of the latency implications of store forwarding. I mean, it's definitely worth a shot, but you'd just be more or less hoping that your mentioned techniques disable the right optimization pass.


That's true, although I'm curious if any compiler really understands the microarchitecture at that level without being coerced by a compiler dev writing a pass (i.e. won't happen straight from a pipeline description)


Do you know of any tools for generating this, possibly with runtime data? Been wanting to do this ever since I learned about the feature but I don't want to do this by hand for dependencies.

Edit: It is possible that I just don't understand how to actually implement PGO


I don't know what exactly is required, but the basic business is

  gcc -fprofile-generate ...
  ./a.out ...
  gcc -fprofile-use ...
I don't know the current state of https://gcc.gnu.org/wiki/AutoFDO


I see something rather different with that example, using GCC 10 on SKX with that code and program argument, bound to a core, +/- ~2%:

-O2 -g: 1.07s

-O3 -g: 1.17s

-O3 -g -fprofile-generate/use: 0.84s

(maqao cqa from https://maqao.org would give comprehensive info on the generated code.)

--

   /* Tom, don't let me ever catch you using bubblesort again! -- Don */
— Knuth in dvips/afm2tfm.c


Are you also on Ryzen? I get:

clang -O3 -march=native: 0.769s

clang -O3 : 0.830s

clang -O2 -march=native : 0.904s

gcc -O2 -march=native :1.007s

gcc -Os -march=native : 1.148s

gcc -O3 -march=native : 1.928s

gcc (GCC) 11.1.0

clang version 12.0.1

On Ryzen 3700x


To me the more interesting meta-point here is that some algorithms are less microarchitecture friendly than others. Algorithms with larger distances between data dependencies are less likely to stall.

For example I compared an insertion sort and see the same performance between O2 and O3:

    void insertionsort(int *buf)
    {
      for (int i = 1; i < n; ++i) {
        const int x = buf[i];
        int j = i - 1;
        for (; j >= 0 && buf[j] > x; --j) buf[j + 1] = buf[j];
        buf[j + 1] = x;
      }
    }
bubble O2: 1.208s

bubble O3: 1.469s

insertion O2: 0.126s

insertion O3: 0.126s


If you've spent any amount of time on StackOverflow around C/C++/assembly, you could probably tell within the first three paragraphs of the answer who authored it. :)


Lots of things are slower at higher -O levels, many things get slower if you compile them for your native instruction extensions, and there are other surprises lurking all over the compiler. The only way to be sure is to sweep out the parameter space of your compiler, using benchmark results as your objective function.


Dude spent more time on the explanation than I spend on my “good” projects at work!


Mandatory xkcd https://xkcd.com/664/


I've written a lot of interpreters in C/++ lately, and spent a lot of time profiling and optimizing them.

I can't remember -O3 ever being faster than -O2, it's usually slightly slower for me.


For interpreters, that's highly likely to be true, especially if they are mainly spending time in switch tables or executing basic opcodes.


Care to explain? -O3 generates larger code than -O2?


Yes, -O3 tends to include a lot of features that increase code size, like aggressive loop unrolling. If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.

In the past, at a time when I worked on a very performance sensitive codebase that was also limited in scope, we compiled with -Osize and did all the loop optimizations we wanted manually (and with pragmas). That produced faster code than -O2 or -O3.


Regarding unrolling, -O3 contains -funroll-and-jam but not -funroll-loops. You may want one or the other, maybe both, depending on circumstances. I don't see much benefit from the available pragmas on HPC-type code unless for OpenMP, and "omp simd" isn't necessary to get vectorization in the places I've seen people say it is. Mileage always varies somewhat, of course. (Before second-guessing anything, use -fopt-info.)


Modern x86 CPUs have micro instr caches to store small loops (about 50 instr) and medium loops (~2k instr). Also, the bottleneck is usually the instruction decoding (Alder Lake made huge changes on that, so this might change).

In other words, loop unrolling is, more often than not, harmful.


It’s a shame that Osize can sometimes produce truly awful code. There are a few optimisations in there that trade a byte for a massive slowdown.


You asked for minimum size, and that's what you got. I'd say that's working as it should.

A more granular control over optimisation would be good, however.


Probably just some tweaks to O2 would be enough, after all people are selecting Os over O2 because they see better performance, and that should not be happening.


You can enable/disable individual optimizations. How much more granular do you need?


Surely a profile-guided build should be able to only apply -Os to those functions where it doesn't cause a lot of problems.


In the application I referred to, PGO was also used. However, that only applies -Os to cold code, and if what you're doing is very branchy, it can help even in the hot path.


Python, specifically, is supposed to get most benefit from -fno-semantic-interposition


What about when you factor in PGO? (-fprofile-generate & -fprofile-use)


This is the rub. the compiler is flying completely blind without PGO.

I turned on PGO for the D compiler recently and it compiled it's tests 15% faster. Boom.


Haven't really been down that rabbit hole yet, thanks for reminding me.


From my experience, -O2 includes optimizations that improve performance in the majority of cases. Whereas -O3 contains additional optimizations that may improve performance but the results are less clear-cut - hence why they're included in -O3 as opposed to -O2. I've seen cases where -O3 is faster than -O2, and vice versa.


That's why its infuriating that vectorization was not enabled at -O2. Since x86-64 supports SSE2 as a baseline, every program has been under performing at -O2 for over 15 years. Without knowing this it turns out the recent introduction of -march=x86-64-v2 and v3 dont benefit either when both have AVX2 which goes largely unused. See pathetic test results here: https://github.com/solvespace/solvespace/issues/972

Notice that v3 is no better and may even be a hair slower.


Vectorisation is not a clear win, and so doesn’t belong in O2. Certainly it can deliver massive improvements in some cases (as loop unrolling could in Pentium III days), yet can cause significant binary size increases (think icache costs) and makes loops with small iteration counts potentially much slower.

If for some reason you want O2 but with vectorisation you can enable it, and if you want O3 without inlining or loop unrolling (the same thing), you’re welcome to specify that.


The fact that they are changing to include some vectorization options by default for -O2 suggests otherwise. In particular SLP vectorization should have been there all along. Every graphics library 2d and 3d defines vector classes that will easily go faster with the simplest vectorization.

Arguments around code size have been largely irrelevant for quite some time, so long as performance increases.

If I put on a tinfoil hat, I'd suspect Intel was at least an influence. They're the ones adding AVX, AVX2, 512 and providing their own compiler to take advantage of them, while crippling AMD parts. That is until they switched to LLVM.


Code size always matters. The code cache in the cpu is far from infinite and long jumps will stall. Sometimes vectorization can shrink code, so yay. And, like you said, if performance increases, then yay again regardless of size.


Code size always matters, except for all the times when it doesn't. Or when something else matters more.

Code size matters less than it once did, now that we have bigger and smarter caches. But bigger, while sometimes big enough, is never big.


Maybe it could use a different heuristic in O2? Depending on the function it's vital to have autovectorization for performance. Neither -O3 nor -O2 are perfect as a blanket option for all the functions in a big program.


Did you mean the P4? An extremely long pipeline meant loop unrolling helps it far more than it does x86 processors before or after. PIII is more similar to Core/2/i in performance characteristics.

...and then there's Atom, which is also an odd one - I believe it's more like a classic Pentium (P54).


Well I did want to say the P4, but I never actually had one so didn’t want to say something that turned out to be wrong.

Certainly on earlier microarchitectures it did help a lot more than it does now, and there were some nice wins on the PIII for it


See https://gcc.gnu.org/gcc-12/changes.html It's still not necessarily a guaranteed win, I think.


Can you turn on optimizations on a per-function basis?


Yes, and that's sometimes a good idea. GCC has a pragma and function attribute (unfortunately not for Fortran, at least).


I think bubble sort is kind of an optimization nightmare - a loop where subsequent iterations depend on each other - and the dependency is through memory. I'm pretty sure the compiler tries to extract some level of parallelism from the loop and fails miserably.


Relevant talk by Emery Berger: Performance matters https://www.youtube.com/watch?v=r-TLSBdHe1A I'd recommend everyone interested in the subject to watch it.


Now do -Os

my results:

time ./size-optimized 30000 real 0m1,164s

time ./optimized2 30000 real 0m1,207s

time ./optimized3 30000 real 0m1,216s


For a decade I used to compile MySQL myself, and always -O2 and never -O3. And a few other flags. Really made a difference in speed to do yourself, but a product like that can show compiler issues and the first two years were less than stable until the right mix was discovered.


I quite frequently find -O3 to produce slower code than -O2.

This is not actually surprising: the reliable operations are placed in 2, and the dodgy ones in 3. It is possible, even common, for 3 to be faster, but you can't rely on it. You have to measure.


My experience with MSVC and ICC is similar - O3 is often worse than O2, and a lot of the time Os is actually the fastest (and smallest).


MSVC has /O1 and /O2 (and /Os), but not /O3. See https://docs.microsoft.com/en-us/cpp/build/reference/compile... . (I work on MSVC's STL.)


I've always been under the impression that -03, -0fast and related "performance" tweaks are not recommended, and shouldn't be used unless the software in question is specifically written with them in mind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: