Bubble sort slower with -O3 than -O2 with GCC

rkagerer · on Oct 17, 2021

this is pretty clearly an anti-optimization you should report on https://gcc.gnu.org/bugzilla/

Was it ever reported?

I did a quick search on https://gcc.gnu.org/bugzilla for any issues, of any status, tagged with missed-optimization and containing the term "store-forwarding" or "bubble sort" and didn't see it (maybe I'm not searching right?)

mhh__ · on Oct 17, 2021

How to keep the O3 goodness without pessimizations: Use likely/unlikely liberally, unreachable(), and PGO.

If the code slows down, it's usually because the compiler has generated a bunch of code because it doesn't know what hot path to schedule for.

They will generate hundreds to thousands of instructions for factorial if you let them because it turns it into a loop, then vectorizes the loop.

Undefined Behaviour pub quiz question in there ^^

unwind · on Oct 17, 2021

Just to clarify, in this context PGO = Profile-Guided Optimization [1].

[1]: https://en.m.wikipedia.org/wiki/Profile-guided_optimization

jhgb · on Oct 17, 2021

...as opposed to? Off the top of my head I can't seem to remember a single acronym I could possibly confuse it with, and now you got me wondering which common acronym I completely failed to learn.

toxik · on Oct 17, 2021

Pose graph optimization

mhh__ · on Oct 17, 2021

That seems like a bit of a reach given that the thread is obviously about GCC.

gnufx · on Oct 17, 2021

Yes to -fprofile-use. That brought gfortran to essentially the same geometric mean as ifort with reasonable "good" flags for me, running a benchmark set which I think is supposed to show off proprietary compilers. ifort got little benefit from its equivalent.

MauranKilom · on Oct 17, 2021

Not sure if that would have an impact here. GCC is just unaware of the latency implications of store forwarding. I mean, it's definitely worth a shot, but you'd just be more or less hoping that your mentioned techniques disable the right optimization pass.

mhh__ · on Oct 18, 2021

That's true, although I'm curious if any compiler really understands the microarchitecture at that level without being coerced by a compiler dev writing a pass (i.e. won't happen straight from a pipeline description)

bick_nyers · on Oct 18, 2021

Do you know of any tools for generating this, possibly with runtime data? Been wanting to do this ever since I learned about the feature but I don't want to do this by hand for dependencies.

Edit: It is possible that I just don't understand how to actually implement PGO

gnufx · on Oct 18, 2021

I don't know what exactly is required, but the basic business is

  gcc -fprofile-generate ...
  ./a.out ...
  gcc -fprofile-use ...

I don't know the current state of https://gcc.gnu.org/wiki/AutoFDO

gnufx · on Oct 17, 2021

I see something rather different with that example, using GCC 10 on SKX with that code and program argument, bound to a core, +/- ~2%:

-O2 -g: 1.07s

-O3 -g: 1.17s

-O3 -g -fprofile-generate/use: 0.84s

(maqao cqa from https://maqao.org would give comprehensive info on the generated code.)

--

   /* Tom, don't let me ever catch you using bubblesort again! -- Don */

— Knuth in dvips/afm2tfm.c

Ballas · on Oct 19, 2021

Are you also on Ryzen? I get:

clang -O3 -march=native: 0.769s

clang -O3 : 0.830s

clang -O2 -march=native : 0.904s

gcc -O2 -march=native :1.007s

gcc -Os -march=native : 1.148s

gcc -O3 -march=native : 1.928s

gcc (GCC) 11.1.0

clang version 12.0.1

On Ryzen 3700x

mgraczyk · on Oct 17, 2021

To me the more interesting meta-point here is that some algorithms are less microarchitecture friendly than others. Algorithms with larger distances between data dependencies are less likely to stall.

For example I compared an insertion sort and see the same performance between O2 and O3:

    void insertionsort(int *buf)
    {
      for (int i = 1; i < n; ++i) {
        const int x = buf[i];
        int j = i - 1;
        for (; j >= 0 && buf[j] > x; --j) buf[j + 1] = buf[j];
        buf[j + 1] = x;
      }
    }

bubble O2: 1.208s

bubble O3: 1.469s

insertion O2: 0.126s

insertion O3: 0.126s

MauranKilom · on Oct 17, 2021

If you've spent any amount of time on StackOverflow around C/C++/assembly, you could probably tell within the first three paragraphs of the answer who authored it. :)

jeffbee · on Oct 17, 2021

Lots of things are slower at higher -O levels, many things get slower if you compile them for your native instruction extensions, and there are other surprises lurking all over the compiler. The only way to be sure is to sweep out the parameter space of your compiler, using benchmark results as your objective function.

flatiron · on Oct 17, 2021

Dude spent more time on the explanation than I spend on my “good” projects at work!

mcraiha · on Oct 17, 2021

Mandatory xkcd https://xkcd.com/664/

codr7 · on Oct 17, 2021

I've written a lot of interpreters in C/++ lately, and spent a lot of time profiling and optimizing them.

I can't remember -O3 ever being faster than -O2, it's usually slightly slower for me.

DannyBee · on Oct 17, 2021

For interpreters, that's highly likely to be true, especially if they are mainly spending time in switch tables or executing basic opcodes.

infberg · on Oct 17, 2021

Care to explain? -O3 generates larger code than -O2?

pclmulqdq · on Oct 17, 2021

Yes, -O3 tends to include a lot of features that increase code size, like aggressive loop unrolling. If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.

In the past, at a time when I worked on a very performance sensitive codebase that was also limited in scope, we compiled with -Osize and did all the loop optimizations we wanted manually (and with pragmas). That produced faster code than -O2 or -O3.

gnufx · on Oct 17, 2021

Regarding unrolling, -O3 contains -funroll-and-jam but not -funroll-loops. You may want one or the other, maybe both, depending on circumstances. I don't see much benefit from the available pragmas on HPC-type code unless for OpenMP, and "omp simd" isn't necessary to get vectorization in the places I've seen people say it is. Mileage always varies somewhat, of course. (Before second-guessing anything, use -fopt-info.)

boibombeiro · on Oct 18, 2021

Modern x86 CPUs have micro instr caches to store small loops (about 50 instr) and medium loops (~2k instr). Also, the bottleneck is usually the instruction decoding (Alder Lake made huge changes on that, so this might change).

In other words, loop unrolling is, more often than not, harmful.

jleahy · on Oct 17, 2021

It’s a shame that Osize can sometimes produce truly awful code. There are a few optimisations in there that trade a byte for a massive slowdown.

userbinator · on Oct 17, 2021

You asked for minimum size, and that's what you got. I'd say that's working as it should.

A more granular control over optimisation would be good, however.

jleahy · on Oct 18, 2021

Probably just some tweaks to O2 would be enough, after all people are selecting Os over O2 because they see better performance, and that should not be happening.

kevin_thibedeau · on Oct 17, 2021

You can enable/disable individual optimizations. How much more granular do you need?

jhgb · on Oct 17, 2021

Surely a profile-guided build should be able to only apply -Os to those functions where it doesn't cause a lot of problems.

pclmulqdq · on Oct 17, 2021

In the application I referred to, PGO was also used. However, that only applies -Os to cold code, and if what you're doing is very branchy, it can help even in the hot path.

gnufx · on Oct 17, 2021

Python, specifically, is supposed to get most benefit from -fno-semantic-interposition

nly · on Oct 17, 2021

What about when you factor in PGO? (-fprofile-generate & -fprofile-use)

mhh__ · on Oct 17, 2021

This is the rub. the compiler is flying completely blind without PGO.

I turned on PGO for the D compiler recently and it compiled it's tests 15% faster. Boom.

codr7 · on Oct 17, 2021

Haven't really been down that rabbit hole yet, thanks for reminding me.

_fjb4 · on Oct 17, 2021

From my experience, -O2 includes optimizations that improve performance in the majority of cases. Whereas -O3 contains additional optimizations that may improve performance but the results are less clear-cut - hence why they're included in -O3 as opposed to -O2. I've seen cases where -O3 is faster than -O2, and vice versa.

phkahler · on Oct 17, 2021

That's why its infuriating that vectorization was not enabled at -O2. Since x86-64 supports SSE2 as a baseline, every program has been under performing at -O2 for over 15 years. Without knowing this it turns out the recent introduction of -march=x86-64-v2 and v3 dont benefit either when both have AVX2 which goes largely unused. See pathetic test results here: https://github.com/solvespace/solvespace/issues/972

Notice that v3 is no better and may even be a hair slower.

jleahy · on Oct 17, 2021

Vectorisation is not a clear win, and so doesn’t belong in O2. Certainly it can deliver massive improvements in some cases (as loop unrolling could in Pentium III days), yet can cause significant binary size increases (think icache costs) and makes loops with small iteration counts potentially much slower.

If for some reason you want O2 but with vectorisation you can enable it, and if you want O3 without inlining or loop unrolling (the same thing), you’re welcome to specify that.

phkahler · on Oct 17, 2021

The fact that they are changing to include some vectorization options by default for -O2 suggests otherwise. In particular SLP vectorization should have been there all along. Every graphics library 2d and 3d defines vector classes that will easily go faster with the simplest vectorization.

Arguments around code size have been largely irrelevant for quite some time, so long as performance increases.

If I put on a tinfoil hat, I'd suspect Intel was at least an influence. They're the ones adding AVX, AVX2, 512 and providing their own compiler to take advantage of them, while crippling AMD parts. That is until they switched to LLVM.

sroussey · on Oct 17, 2021

Code size always matters. The code cache in the cpu is far from infinite and long jumps will stall. Sometimes vectorization can shrink code, so yay. And, like you said, if performance increases, then yay again regardless of size.

ncmncm · on Oct 18, 2021

Code size always matters, except for all the times when it doesn't. Or when something else matters more.

Code size matters less than it once did, now that we have bigger and smarter caches. But bigger, while sometimes big enough, is never big.

kzrdude · on Oct 17, 2021

Maybe it could use a different heuristic in O2? Depending on the function it's vital to have autovectorization for performance. Neither -O3 nor -O2 are perfect as a blanket option for all the functions in a big program.

userbinator · on Oct 17, 2021

Did you mean the P4? An extremely long pipeline meant loop unrolling helps it far more than it does x86 processors before or after. PIII is more similar to Core/2/i in performance characteristics.

...and then there's Atom, which is also an odd one - I believe it's more like a classic Pentium (P54).

jleahy · on Oct 17, 2021

Well I did want to say the P4, but I never actually had one so didn’t want to say something that turned out to be wrong.

Certainly on earlier microarchitectures it did help a lot more than it does now, and there were some nice wins on the PIII for it

gnufx · on Oct 17, 2021

See https://gcc.gnu.org/gcc-12/changes.html It's still not necessarily a guaranteed win, I think.

amelius · on Oct 17, 2021

Can you turn on optimizations on a per-function basis?

gnufx · on Oct 17, 2021

Yes, and that's sometimes a good idea. GCC has a pragma and function attribute (unfortunately not for Fortran, at least).

torginus · on Oct 17, 2021

I think bubble sort is kind of an optimization nightmare - a loop where subsequent iterations depend on each other - and the dependency is through memory. I'm pretty sure the compiler tries to extract some level of parallelism from the loop and fails miserably.

glandium · on Oct 17, 2021

Relevant talk by Emery Berger: Performance matters https://www.youtube.com/watch?v=r-TLSBdHe1A I'd recommend everyone interested in the subject to watch it.

c21h30o2 · on Oct 18, 2021

Now do -Os

my results:

time ./size-optimized 30000 real 0m1,164s

time ./optimized2 30000 real 0m1,207s

time ./optimized3 30000 real 0m1,216s

sroussey · on Oct 17, 2021

For a decade I used to compile MySQL myself, and always -O2 and never -O3. And a few other flags. Really made a difference in speed to do yourself, but a product like that can show compiler issues and the first two years were less than stable until the right mix was discovered.

ncmncm · on Oct 18, 2021

I quite frequently find -O3 to produce slower code than -O2.

This is not actually surprising: the reliable operations are placed in 2, and the dodgy ones in 3. It is possible, even common, for 3 to be faster, but you can't rely on it. You have to measure.

userbinator · on Oct 17, 2021

My experience with MSVC and ICC is similar - O3 is often worse than O2, and a lot of the time Os is actually the fastest (and smallest).

StephanTLavavej · on Oct 17, 2021

MSVC has /O1 and /O2 (and /Os), but not /O3. See https://docs.microsoft.com/en-us/cpp/build/reference/compile... . (I work on MSVC's STL.)

kiryin · on Oct 17, 2021

I've always been under the impression that -03, -0fast and related "performance" tweaks are not recommended, and shouldn't be used unless the software in question is specifically written with them in mind.