Seems like very basic mistakes were made, *not* at the event but way long before...

ocdtrekkie · on Feb 1, 2017

Helpful hint: Have a employee who regularly accidentally deletes folders. I have a couple, it's why I know my backups work. :D

connorshea · on Feb 1, 2017

Even better, have a Chaos Monkey do it ;)

ocdtrekkie · on Feb 1, 2017

Would you believe I have enough chaos already?

tetrep · on Feb 1, 2017

Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.

koolba · on Feb 1, 2017

Not "can restore them", it's "have restored them".

Best way to ensure that is to have backup restoration be a regularly scheduled event. For most apps I work on, that's either daily or (worst case) weekly, with prod being entirely rebuilt in a lower environment. Works great for creating a test lane too!

mschuster91 · on Feb 1, 2017

> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.

koolba · on Feb 1, 2017

I bet it writes to a log file. It just doesn't alert anyone on failure so the log just grows and grows daily with the same error.

marcosdumay · on Feb 1, 2017

Or it alerts people, but on the same channel every other piece of infrastructure alerts them, and they have a severe case of false positives.

I've seen that many more times than I've seen the "no alert" option.

koolba · on Feb 1, 2017

New guy: "Hey I see an alert that XYZ failed to run."

Existing team: "Yah don't worry about that. It does that every day. We'll get to it sometime soon."