All comes down to uptime. One box can’t be distributed across multiple racks in ...

mozumder · on April 17, 2018

> One box can’t be distributed across multiple racks in the data center to guard against downtime if a switch crashes. Never mind that—one box can’t be deployed across multiple data centers. If you deploy to multiple DCs you can fail over if one DC starts having issues.

Wouldn't you just have multiple NICs on one box for redundancy there? With any backups being sent a database write-log for replication?

> n any case, payments aren’t CPU intensive but it’s a prime case of hurry-up-and-wait. Lots of network IO, so while you won’t saturate the CPU with millions of transactions on the same box, I could easily imagine saturating a NIC.

If you're vertically scaling, wouldn't you just have the main database server host the database files locally, using fast NVMe SSDs (or Optame), in the box itself, instead of going over the network?

Enterprise NVMe drives can perform 500,000-2,000,000 IOPs, with about 60us latency. And Optane is about 4x faster. Why would a database server need to saturate network bandwidth?

Anyways, I'd love to see the actual SQL query for one of their transactions...

icebraining · on April 17, 2018

Wouldn't you just have multiple NICs on one box for redundancy there?

What happens when the FBI raids the DC to confiscate the servers of another person, and also takes yours? https://blog.pinboard.in/2011/06/faq_about_the_recent_fbi_ra...

saryant · on April 17, 2018

I'm largely referring to RPC calls, not DB queries. Many of those calls won't even be to services you control and may well be HTTP calls to other companies.

gaius · on April 17, 2018

All comes down to uptime.

20 years ago we had 1000+ days uptime on DEC kit. No one was even impressed by 500 days. Nowadays people build all sorts of elaborate contraptions to do what used to be entirely ordinary

zzzcpan · on April 17, 2018

By uptime people usually mean availability to the end users, not a literal uptime. Which also includes availability of an entire datacenter infrastructure, connectivity, internet infrastructure, making it pretty much impossible to have high availability in a singe datacenter.

gaius · on April 18, 2018

Heh, I guess. In my scenario the users actually got that uptime too, ‘cos they were connected over LAT...

saryant · on April 17, 2018

Doesn't do much good if you have to fail out of an entire data center.

gaius · on April 18, 2018

You can with VMScluster. There are multi-site clusters with 15+ years uptime.