So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.
My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)
Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.
Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.
My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)