- I imagine the extra memory bandwidth of newer parts doesn't hurt. The example traces were taken on server-class Ice Lake machines. They just don't overflow for our typical workloads.
- We found the specific IPT configuration matters a lot. Turning off return compression is more liable to result in overflows. We allow varying this in magic-trace via the `-timing-resolution` parameter, more detail available in the wiki. We don't typically see overflows under the default configuration even on Broadwell server-class parts.
- Clark spent a week on an Intel NUC (mobile Tiger Lake part) toiling away on decode error recovery. For the most part, the data lost are uninteresting branches, and you only need one of the call in / return out of a frame to survive the decode error to be able to construct a frame for it.
We also considered the periodic stack sampling approach for error recovery, but ended up not implementing it since the decode error recovery we implemented ended up being robust enough in practice.
We ended up having more trouble with runtimes that mess with the stack pointer directly. (The kernel does this for the retpoline Spectre mitigation! But perf is smart and rewrites that part of the instruction stream into a jump for us.) There's code in magic-trace to special-case OCaml exceptions, for instance, and it's likely similar code is necessary for some other runtimes too (we have an open issue for Go's coroutine switching).
DDIO operates mostly transparently to software, with the I/O controller feeding DMAs into a slice of L3. Hardware can opt out by setting PCIe TLP header hints, and you have some system-wide configurability via MSRs, but it's not something a userspace application can take into its own hands.
Absolutely! This is one of the main features of magic-trace, and in fact a primary use-case.
You can select a trigger symbol for magic-trace to snapshot upon the next call of. This can be whatever you want, and you can imagine writing code like
if (something_really_wonky_happened) { take_magic_trace(); }
and asking magic-trace to take a snapshot of the past only when `take_magic_trace` is called.
We do try to support scripted languages with JITs that can emit info about what symbol is located where [1]. Notably, this more or less works for Node.js. It'll
work somewhat for Python in that you'll see the Python interpreter frames (probably uninteresting), but you will see any ffi calls (e.g., numpy) with proper stacks.
It's worth noting that aside from the overhead, function call / returns are not quite enough to reconstruct the callstack: tailcalls are just regular branch instructions.
We don't have plans to add ARM support largely because we have no in-house expertise with ARM. That said, ARM has CoreSight which sounds like it could support something like magic-trace in some form, and we'd definitely be open to community contributions for CoreSight support in magic-trace.
- I imagine the extra memory bandwidth of newer parts doesn't hurt. The example traces were taken on server-class Ice Lake machines. They just don't overflow for our typical workloads.
- We found the specific IPT configuration matters a lot. Turning off return compression is more liable to result in overflows. We allow varying this in magic-trace via the `-timing-resolution` parameter, more detail available in the wiki. We don't typically see overflows under the default configuration even on Broadwell server-class parts.
- Clark spent a week on an Intel NUC (mobile Tiger Lake part) toiling away on decode error recovery. For the most part, the data lost are uninteresting branches, and you only need one of the call in / return out of a frame to survive the decode error to be able to construct a frame for it.
We also considered the periodic stack sampling approach for error recovery, but ended up not implementing it since the decode error recovery we implemented ended up being robust enough in practice.
We ended up having more trouble with runtimes that mess with the stack pointer directly. (The kernel does this for the retpoline Spectre mitigation! But perf is smart and rewrites that part of the instruction stream into a jump for us.) There's code in magic-trace to special-case OCaml exceptions, for instance, and it's likely similar code is necessary for some other runtimes too (we have an open issue for Go's coroutine switching).