Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?


Helpful hint: Have a employee who regularly accidentally deletes folders. I have a couple, it's why I know my backups work. :D


Even better, have a Chaos Monkey do it ;)


Would you believe I have enough chaos already?


Yeah, the "You don't have backups unless you can restore them" stikes again.

Virtually the only way to lose data is to not have backups. We live in such fancy times that there's no reason to ever lose data that you care about.


Not "can restore them", it's "have restored them".

Best way to ensure that is to have backup restoration be a regularly scheduled event. For most apps I work on, that's either daily or (worst case) weekly, with prod being entirely rebuilt in a lower environment. Works great for creating a test lane too!


> How does it go unnoticed that S3 backups don't work for so long?

My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.


I bet it writes to a log file. It just doesn't alert anyone on failure so the log just grows and grows daily with the same error.


Or it alerts people, but on the same channel every other piece of infrastructure alerts them, and they have a severe case of false positives.

I've seen that many more times than I've seen the "no alert" option.


New guy: "Hey I see an alert that XYZ failed to run."

Existing team: "Yah don't worry about that. It does that every day. We'll get to it sometime soon."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: