Maybe I expect too much, but when I started reading I was under the impression they were going to have discovered a previously undiscovered bug.
Instead I found they hadn't applied microcode updates and were suffering from a bug that causes. Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).
I would expect this of a small company, who don't have the resources to look after their fleet, but not from cloudflare. I also don't get their dig about 'errata', it is standard practice to include post-release hardware faults (and their solutions) in errata.
You also don't (at any kind of scale) go around willy-nilly applying updates, even microcode updates. Ideally there's a graduated rollout of any update (in dev, then to testing, then to limited production, and then after a reasonable period, to full production), and that can take time and resources.
This process would be helped immensely if the sources of the updates (whether Intel, Microsoft, Google, or whoever) did more rigorous testing themselves, but in the modern age of "churn, baby, churn", they don't, and as a result, organizations that are uptime-sensitive have gotten reluctant to apply updates in a timely manner.
And to head off the inevitable calls of "bullshit", Intel just released a microcode update that caused some Linux distributions to fail to boot entirely and caused some Windows-based systems to reboot spontaneously.
This is not DevOps. This is more of Sysadmin territory. And Sysadmins have hard jobs. Most places don't require top-notch sysadmins because 80/20. But getting that last 20 as a sysadmin is incredibly difficult.
Yes and no. For somewhere like CloudFlare I would expect devops to manage this kind of sysadmin. The roles aren't sharply defined so it really is syntactic
except firmware updates tend to also remove all of your hard-won bios customizations and break other stuff and cause downtime and have crappy update methods in general and ...
That is a lot of wasted time for an already known and fixed bug. At the scale of cloudflare I would have expected them to have a few people intimately familiar with their server hardware and in steady contact with their vendors, and to know and evaluate every firmware update for every component they use.
At my scale, we install whatever is publicly available at a time of our convenience in hopeful trust the vendor knew what they were doing.
> SIGSEGV is not the only signal that indicates an error in a process and causes termination. We also saw process terminations due to SIGABRT and SIGILL
Also SIGBUS. To me it happens mostly when playing with mmap and shared memory. But casting random numbers to pointers and trying to access those should do that as well. Don't be too surprised if you see it once in a while.
> “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior”
Lovely.
In this sense I've always liked when various layers implement their own checksumming and sanity checking. Don't just rely on hardware to do it (ECC, raid controllers etc). Wrote a database which saves a chunk to disk? Add a checksum with it. Have something that sends data over the wire, send a checksum with it. With disks it's even more fun - periodically read and reverify it as bit rot will slowly eat away at it. Same with backups, verify that your backups can be read and data there is consistent.
It's not cheap and you'd pay a penalty for it in performance so it's a tradeoff definitely. Just make sure to consider it and don't forget about it.
> There was no obvious pattern to the servers which produced these mystery core dumps. We were getting about one a day on average across our fleet of servers...The probability that an individual server would get a mystery core dump seemed to be very low (about one per ten years of server uptime, assuming they were indeed equally likely for all our servers).
Does that tell us that Cloudflare has about 365 * 10 servers in their fleet? I can never quite work out probabilities but I figured they'd have more.
EDIT:
> all of the mystery core dumps came from servers containing The Intel Xeon E5-2650 v4. This model belongs to the generation of Intel processors that had the codename “Broadwell”, and it’s the only model of that generation that we use in our edge servers, so we simply call these servers Broadwells. The Broadwells made up about a third of our fleet at that time...
So at least 3 times the number of servers the above probabilities would suggest, and for edge nodes only. Not sure why I find working out this info fun but there you go.
Aside: per your last sentence, I have this affliction. If someone says a sum flippantly "the sqrt of 3M + 42" then I find a huge compulsion to render the answer.
> The most convenient way for us to apply the microcode update to our Broadwell servers at that time was via a BIOS update from the server vendor.
You can read this article with another point of view: there was a bug somewhere that was leaking costumer information. After searching for a while, they discovered that the CPU had a bug that had already being solved/patched, all it took was a BIOS update, that wasn't done before because of... well, whatever.
The CPU bug wasn't causing a security problem. We made a conscious decision to get crashes in production down to zero so that we would be alerted early if a subsequent security issue of a similar type occurred.
During that investigation we came across mystery crashes which were already fixed by a microcode update.
Thank you! We often use on Creative Commons-licensed images in our blog posts. We always include credit, but we owe a big debt of gratitude to the people who take these photos and make them available.
Instead I found they hadn't applied microcode updates and were suffering from a bug that causes. Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).
I would expect this of a small company, who don't have the resources to look after their fleet, but not from cloudflare. I also don't get their dig about 'errata', it is standard practice to include post-release hardware faults (and their solutions) in errata.