The prefetch will always take same amount of data no matter what the size of python object in question is. That leads two results. 1) You can miss if your cacheline size is less than python object size. 2) You miss if your python object is not alligned at the cacheline and comparison is with the part that is in the next cacheline side.
If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.
Another about prefetch distance I'd pick 5 based on his testing data.
1) It has no downside on current tests.
2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance.
3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test.
4) It allows a nice factor of improvement on CORE execution speed relative to memory latency.
5) It allows you to handle situations of page misses better.
The prefetch may take the same amount of time, but the number of objects that it pulls into cache depends on their size and allocation pattern. If they are large or were allocated over time (= from different memory blocks) fewer will be pulled in. It seems to me that this would make prefetch less effective. That is, if the thing you are trying to prefetch was pulled in because it was close enough to the last one, prefetch is a no-op.
If these two condition apply then there is limit of 2 on speed up and increasing python object size reduces the effect of prefetch which still helps.
Another about prefetch distance I'd pick 5 based on his testing data. 1) It has no downside on current tests. 2) Due to prefetch constant size any prefetch that hurts because you pick distance of 5 instead of 3 is unlikely to have any meaningfull effect at any prefetch distance. 3) It allows some improvement in comparison speed. Like if there where JIT on python side to specialize the test. 4) It allows a nice factor of improvement on CORE execution speed relative to memory latency. 5) It allows you to handle situations of page misses better.