So for any process that's using less than 16GB, it's a significant performance b...

jeffbee · 2026-03-16T22:41:52 1773700912

That strikes me as a common hugepages win. People never believe you, though, when you say you can make their thing 20% faster for free.

menaerus · 2026-03-17T13:51:07 1773755467

Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.

Dylan16807 · 2026-03-17T18:18:54 1773771534

Try a big factorio map just as a test case. It's a bit of an outlier on performance, in particular it's very heavy on memory bandwidth.

jeffbee · 2026-03-17T15:53:28 1773762808

Of course, it only helps workloads that exhibit high rates of page table walking per instruction. But those are really common.

menaerus · 2026-03-17T19:04:32 1773774272

Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.

jeffbee · 2026-03-17T19:37:37 1773776257

In a multi-tenant environment, yes. The faster they can get off the CPU and yield to some other tenant, the better it is.