However improbable: The story of a processor bug

Normal_gaussian · on Jan 19, 2018

Maybe I expect too much, but when I started reading I was under the impression they were going to have discovered a previously undiscovered bug.

Instead I found they hadn't applied microcode updates and were suffering from a bug that causes. Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).

I would expect this of a small company, who don't have the resources to look after their fleet, but not from cloudflare. I also don't get their dig about 'errata', it is standard practice to include post-release hardware faults (and their solutions) in errata.

thaumaturgy · on Jan 19, 2018

You also don't (at any kind of scale) go around willy-nilly applying updates, even microcode updates. Ideally there's a graduated rollout of any update (in dev, then to testing, then to limited production, and then after a reasonable period, to full production), and that can take time and resources.

This process would be helped immensely if the sources of the updates (whether Intel, Microsoft, Google, or whoever) did more rigorous testing themselves, but in the modern age of "churn, baby, churn", they don't, and as a result, organizations that are uptime-sensitive have gotten reluctant to apply updates in a timely manner.

And to head off the inevitable calls of "bullshit", Intel just released a microcode update that caused some Linux distributions to fail to boot entirely and caused some Windows-based systems to reboot spontaneously.

Normal_gaussian · on Jan 20, 2018

> Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).

eximius · on Jan 19, 2018

This is not DevOps. This is more of Sysadmin territory. And Sysadmins have hard jobs. Most places don't require top-notch sysadmins because 80/20. But getting that last 20 as a sysadmin is incredibly difficult.

ASalazarMX · on Jan 19, 2018

Is devops really meant to keep microcode patches up to date? Isn't hardware and OS security sysadmin work?

Normal_gaussian · on Jan 19, 2018

Yes and no. For somewhere like CloudFlare I would expect devops to manage this kind of sysadmin. The roles aren't sharply defined so it really is syntactic

rconti · on Jan 19, 2018

Except many companies pursue an either-or model. Which may be precisely how they ended up in this mess.

liveoneggs · on Jan 20, 2018

except firmware updates tend to also remove all of your hard-won bios customizations and break other stuff and cause downtime and have crappy update methods in general and ...

Normal_gaussian · on Jan 20, 2018

> Would it not be devops 101 to monitor what updates are available to your fleet (and selectively apply).

jo909 · on Jan 19, 2018

That is a lot of wasted time for an already known and fixed bug. At the scale of cloudflare I would have expected them to have a few people intimately familiar with their server hardware and in steady contact with their vendors, and to know and evaluate every firmware update for every component they use.

At my scale, we install whatever is publicly available at a time of our convenience in hopeful trust the vendor knew what they were doing.

rdtsc · on Jan 19, 2018

> SIGSEGV is not the only signal that indicates an error in a process and causes termination. We also saw process terminations due to SIGABRT and SIGILL

Also SIGBUS. To me it happens mostly when playing with mmap and shared memory. But casting random numbers to pointers and trying to access those should do that as well. Don't be too surprised if you see it once in a while.

> “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior”

Lovely.

In this sense I've always liked when various layers implement their own checksumming and sanity checking. Don't just rely on hardware to do it (ECC, raid controllers etc). Wrote a database which saves a chunk to disk? Add a checksum with it. Have something that sends data over the wire, send a checksum with it. With disks it's even more fun - periodically read and reverify it as bit rot will slowly eat away at it. Same with backups, verify that your backups can be read and data there is consistent.

It's not cheap and you'd pay a penalty for it in performance so it's a tradeoff definitely. Just make sure to consider it and don't forget about it.

jacquesm · on Jan 19, 2018

SIGBUS tend to be alignment errors and cross page references gone wrong.

rdtsc · on Jan 20, 2018

Indeed it was cross pages references from my mmap and shared memory setup.

PuffinBlue · on Jan 19, 2018

> There was no obvious pattern to the servers which produced these mystery core dumps. We were getting about one a day on average across our fleet of servers...The probability that an individual server would get a mystery core dump seemed to be very low (about one per ten years of server uptime, assuming they were indeed equally likely for all our servers).

Does that tell us that Cloudflare has about 365 * 10 servers in their fleet? I can never quite work out probabilities but I figured they'd have more.

EDIT:

> all of the mystery core dumps came from servers containing The Intel Xeon E5-2650 v4. This model belongs to the generation of Intel processors that had the codename “Broadwell”, and it’s the only model of that generation that we use in our edge servers, so we simply call these servers Broadwells. The Broadwells made up about a third of our fleet at that time...

So at least 3 times the number of servers the above probabilities would suggest, and for edge nodes only. Not sure why I find working out this info fun but there you go.

ikeboy · on Jan 19, 2018

The probability is assuming all are equally likely, which turned out to be wrong. So the 365*10 number should be around right.

bluedino · on Jan 19, 2018

I thought I heard '4,000 servers' at one point

pbhjpbhj · on Jan 19, 2018

Aside: per your last sentence, I have this affliction. If someone says a sum flippantly "the sqrt of 3M + 42" then I find a huge compulsion to render the answer.

homero · on Jan 19, 2018

I never realized my bios or Windows was live patching my processor. Now I'm wondering if I'm missing updates since neither tell me.

fulafel · on Jan 20, 2018

+ Intel is badly undercommunicating the cpu bug impacts in their errata.

laythea · on Jan 19, 2018

I wouldn't worry about that. Please allow Windows to determine what platform you are running. It knows best, after all.

woliveirajr · on Jan 19, 2018

> The most convenient way for us to apply the microcode update to our Broadwell servers at that time was via a BIOS update from the server vendor.

You can read this article with another point of view: there was a bug somewhere that was leaking costumer information. After searching for a while, they discovered that the CPU had a bug that had already being solved/patched, all it took was a BIOS update, that wasn't done before because of... well, whatever.

jgrahamc · on Jan 19, 2018

The CPU bug wasn't causing a security problem. We made a conscious decision to get crashes in production down to zero so that we would be alerted early if a subsequent security issue of a similar type occurred.

During that investigation we came across mystery crashes which were already fixed by a microcode update.

woliveirajr · on Jan 19, 2018

:) thanks for clarifying

paradroid · on Jan 19, 2018

@cloudflare I took the photo in your lede :)

dpw · on Jan 20, 2018

Thank you! We often use on Creative Commons-licensed images in our blog posts. We always include credit, but we owe a big debt of gratitude to the people who take these photos and make them available.