Maybe I expect too much, but when I started reading I was under the impression they were going to have discovered a previously undiscovered bug.
Instead I found they hadn't applied microcode updates and were suffering from a bug that causes. Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).
I would expect this of a small company, who don't have the resources to look after their fleet, but not from cloudflare. I also don't get their dig about 'errata', it is standard practice to include post-release hardware faults (and their solutions) in errata.
You also don't (at any kind of scale) go around willy-nilly applying updates, even microcode updates. Ideally there's a graduated rollout of any update (in dev, then to testing, then to limited production, and then after a reasonable period, to full production), and that can take time and resources.
This process would be helped immensely if the sources of the updates (whether Intel, Microsoft, Google, or whoever) did more rigorous testing themselves, but in the modern age of "churn, baby, churn", they don't, and as a result, organizations that are uptime-sensitive have gotten reluctant to apply updates in a timely manner.
And to head off the inevitable calls of "bullshit", Intel just released a microcode update that caused some Linux distributions to fail to boot entirely and caused some Windows-based systems to reboot spontaneously.
This is not DevOps. This is more of Sysadmin territory. And Sysadmins have hard jobs. Most places don't require top-notch sysadmins because 80/20. But getting that last 20 as a sysadmin is incredibly difficult.
Yes and no. For somewhere like CloudFlare I would expect devops to manage this kind of sysadmin. The roles aren't sharply defined so it really is syntactic
except firmware updates tend to also remove all of your hard-won bios customizations and break other stuff and cause downtime and have crappy update methods in general and ...
Instead I found they hadn't applied microcode updates and were suffering from a bug that causes. Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).
I would expect this of a small company, who don't have the resources to look after their fleet, but not from cloudflare. I also don't get their dig about 'errata', it is standard practice to include post-release hardware faults (and their solutions) in errata.