I don't think you can make a blanket statement here; it's going to really depend...

BubRoss · on June 13, 2019

There are a lot of assumptions and some misunderstanding here. The data locality is about latency first and foremost. DDR3 at it slowest actually has 30GB /s of bandwidth and DDR4 can get past 70. Memory bandwidth is rarely the issue.

Also instructions shouldn't be huge, but more importantly they don't change. If the audio buffer stays on the same CPU, it doesn't change either.

Don't forget that writing takes time too. Writing can be a big bottleneck. Keep the data local to the same CPU and it doesn't have to go out to main memory yet.

Other things you are saying about 'flushing' the instruction cache, L3 bandwidth numbers and theoretical LUT that make a difference in one scenario and not the other without measuring (even though the whole scenario is made up) just seem like stabs in the dark to argue about vague what-ifs.

reitzensteinm · on June 13, 2019

Skylake-X L3 latency is ~20ns. So if you build an SPSC queue between them, how many plugins are we chaining up linearly that this becomes an issue, or even a factor? 1000 might get us to 1ms?

OK, so we're left with a single core running a thousand plugins, and instruction cache pressure is a 'stab in the dark to argue about vague what-ifs'?

You take an absolutist view on what is so obviously a complicated trade off and talk down to me to boot. Maybe I know about high performance code, maybe I don't, maybe you do, maybe you don't. But I do know enough about talking to people on the internet to know to nip this conversation in the bud.

BubRoss · on June 13, 2019

> Skylake-X L3 latency is ~20ns. So if you build an SPSC queue between them

The latency is mostly about initial cache misses. There is no reason to take the time to write out a buffer of samples to memory, only to have another CPU access them with a cache miss. One of many things things you are missing here is prefetching. Instructions will be heavily prefetched as will samples when accessesed in any sort of linear fashion.

Also you can't explicit use caches or send data between them, that is going to be up to the CPU, and it will use the whole cache heirarchy.

> You take an absolutist view

Everything dealing with performance needs to be measured, but I have a good idea of how things work so I know what to prioritize and try first. Architecture is really the key to these things and in my replies I've illustrated why.

> Maybe I know about high performance code, maybe I don't

It sounds like you have read enough, but haven't necessarily gone through lots of optimizations and recitified what you know with the results of profiling. Understanding modern CPUs is good for understanding why results happen, but less so for estimating exactly what the results will be when going in blind.

> maybe you do, maybe you don't

I've got a decent handle on it at this point.

reitzensteinm · on June 13, 2019

If you were as good as you claim, you would have directly answered my argument instead of hitting a strawman for five paragraphs.

Your experience led to overconfidence and you identified a ridiculous bottleneck for the problem domain. This is complicated and FPU heavy code running on few pieces of tiny data. And yes, riddled with LUTs. The latency cost you're worried about is in the noise.

Instead of doing some back of the envelope calculations and realizing your mistake, you double down, handwave and smugly attack me.

Your conclusions are bullshit, as is your evaluation of my experience. For anyone else that happens to be reading, I suggest taking a look through the source of a few plugins and judging for yourself.

https://vcvrack.com/plugins.html

BubRoss · on June 13, 2019

There is no need to be upset, there is no real finality here, everything has to be measured.

That being said the LUTs would follow the same pattern as execution - all threads would use them and if they are a part of the executable they don't change. This combined with prefetching and out of order instructions means that their latency is likely to be hidden by the cache.

New data coming through however would be transformed, creating more new data. While the instructions and LUTs aren't changing the new data being created on each transformation can either be kept locally so it doesn't incur the same write back penalties and cache misses by due to allocating new memory, writing to it and eventually getting it to another CPU.

If the same CPU is working on the same memory buffer there is no need to try to allocate them for every filter or manage lifetimes and ownership of various buffers.

reitzensteinm · on June 14, 2019

If you took time to read the code linked, you'd notice two things:

1) It's very common for the processing of samples to not be independent, but have iterative state; for example delay effects, amplifiers, noise gates...

2) The work done per sample is substantial with nested loops, trig functions and hard to vectorize patterns

So not only does your technique break the model of the problem domain, the L3 latency you're so worried about when retrieving a block of samples is comparable to a single call to sin, which in some cases we're doing multiple times per sample.

Now you conflate passing data between threads with memory allocation, as though SPSC ring buffers aren't a trivial building block. This is after lecturing me on my many "misunderstandings"... if you're willing to assume I'm advocating malloc in the critical path (!?), no wonder you're finding so many.

I'm not upset, I'm just being blunt. Ditch the cockiness, or at least reserve it for when your arguments are bulletproof.

BubRoss · on June 14, 2019

> L3 latency you're so worried about

I'm not sure where this is coming from. If one cpu is generating new data and another CPU is picking it up, it's wasting locality. If lots of new data is generated it might get to other CPUs though shared cache or memory, but either way it isn't necessary.

Data accessed linearly is prefetched and latency is eventually hidden. This, combined with the fact that instructions aren't changing and are usually tiny in comparison, is why instruction locality is not the primary problem to solve.

The difference it makes it up to measurement, but trying to pin one filter per core is a simplistic and naive answer. It implies that concurrency is dependent on how many different transformations exist, when the reality is that the number of cores.that can be utilized will come down to the number of groups of data that can be dealt with without dependencies.

> SPSC ring buffers

That's a form of memory allocation. When you fabricate something to argue against, that's called a straw man fallacy.

reitzensteinm · on June 14, 2019

Are you claiming the act of writing bytes to a ring buffer as being memory allocation? In that case I misunderstood what you were saying and it was indeed a straw man.

In any case, we're clearly not going to find common ground here.