ms705's comments

ms705 · on June 17, 2019

I'm one of the authors of the paper.

Thanks for pointing this out -- while I knew about various advanced Oracle features, I'd missed the Database Vault feature.

The multiverse database idea and Oracle's Database Vault are different, but somewhat complementary:

1) Database Vault is designed to protect sensitive application data against privileged users (DBAs, highly privileged role accounts). This is an important problem: administrator accounts are juicy targets, and restricting their access while still allowing maintenance activities is hard. Multiverse databases, as described in the paper, do not solve this problem, as the administrator still has access to the base universe. But ideas from Database Vault and its "realms" concept could be combined with the multiverse DB concept to achieve this!

2) Unlike multiverse databases, Database Vault does not protect the data against a buggy application, nor does it isolate different end-users within a single application from each other. In practice, this is where a lot of leaks do damage: a bug in the frontend either exposes information or can be exploited to expose it. Database Vault won't help here, as the leak does not even require a privileged user to be involved, and since the application runs inside a single realm.

You can see multiverse databases as the DB vault idea with application-level end-users each having their own realm. Making realms/universes work efficiently at scale, though, poses some serious systems research challenges. It also raises questions about how to write the policies defining what's visible to each end-user -- something that Database Vault doesn't have to deal with, because it merely requires an application-specific policy that indicates what information is sensitive and needs protecting.

ms705 · on Aug 18, 2015

Hi, author here.

The talk for which I put the slide deck together was given at a summer school and unfortunately not recorded, but if there is sufficient interest, I might tape a re-run and upload it. (Though unlikely to have time in the next month, so it might be a while.)

In the meantime, the original papers (listed in the bibliographies at http://malteschwarzkopf.de/research/assets/google-stack.pdf and http://malteschwarzkopf.de/research/assets/facebook-stack.pd...) have a lot more detail than my (very condensed) slides.

badtoyx · on Aug 18, 2015

Please DO re-run the talk and share it with us. It would be quite valuable for a variety of audience. Thanks in advance!

whitenoice · on Aug 18, 2015

Thank you for sharing this! and the consolidated papers list. Would be great if you can record and upload your next re-run!

cronopios · on Aug 18, 2015

Wow, that would be awesome!

I hope you do it!

vayarajesh · on Aug 18, 2015

Great! thanks for the share

ms705 · on July 8, 2015

Interesting -- though a quick scan of the evaluation data sets suggests that none of them are as large as the Twitter one (1.1B edges) or the uk-2007-05 one (3B edges) that we (and other distributed graph processing systems) use. Presumably this is due to memory limitations on the GPU?

ms705 · on July 8, 2015

We will go into this a bit in part 2 of the blog post, but the bottom line is that it doesn't look like GraphX is bottlenecked on barrier-sync latency in this computation. In fact, the iterations in GraphX are quite long and hardly use the network at all, so we're not sure if there's much fine-grained synchronization going on.

That said, leaving the implementation details aside, latency is definitely a big deal, but 10G does help there, too: the latency for sending a fixed-size message can be a lot lower on an idle 10G network than on an idle 1G network. If we're talking about very small synchronization messages, then maybe there isn't much of a gain (network stack overhead dominates), but techniques like our destination-oriented edge processing help reduce the need for very fine-grained synchronization (for this computation at least). The only barrier-synchronization necessary in our fast 10G implementation is at the point at which no more updates are to be sent by any worker (this only happens once per iteration).

You're quite right, however, that a lot of the work on network scheduling for big data computations mentioned in the NSDI paper operates at the coarse-grained level of some kind of 'flow' notion. This would indeed be very hard to disambiguate in our implementation (part 2 will show this in more detail); I'm not convinced that these algorithms would help timely dataflow at all.

xtacy · on July 8, 2015

Thanks, yes, I realised it wasn't really barrier-sync latency after I wrote the comment. :)

I remember that the NSDI paper actually made an Amdahl's-law-like argument (they give it a new name) and did something to the tune of "let's just eliminate time waiting on the network from the total runtime, which makes the network infinitely fast."

Coming back to the post: If it's CPU overhead, shouldn't Java be pretty competitive with C/C++/Rust for common computations? There might be a lot of other things going on that lower might affect how much one can squeeze from the CPU (GC/object sizes, time spent in reflection/serialisation, maybe?).

It would be great to look at (a) the number of instructions that Java and the Rust implementation execute, and (b) the instructions-per-cycle issued (or its inverse, the CPI) in both cases. If it's memory sync that's slowing down Java, then Java's CPI must be (edit) _higher_ than Rust's.

ms705 · on July 8, 2015

Yep; the NSDI paper is correct in that there's hardly any time spent waiting on the network (as our traces in part 2 will show). However, that is not to say that the network being faster cannot help: if computation and communication are perfectly overlapped, then "blocked time analysis" (term from the NSDI paper) would not show any potential improvement, but faster communication can still improve the overall runtime (e.g., by reducing busy polling, or crucial updates arriving sooner).

The CPI number investigation is quite a good idea -- we in fact already have these numbers for the Rust-based timely dataflow, but I'll have a look to see how hard it'd be to get them for GraphX/Spark.

xtacy · on July 9, 2015

Yep, that's right. Looking forward to the CPI numbers!

ms705 · on June 3, 2015

Original author here.

This is interesting -- I tried to do my best to find out what's still in use and what is deprecated based on public information; happy to amend if anything is incorrect. (If you have publicly accessible written sources saying so, that'd be ideal!)

Note that owing to its origin (as part of my PhD thesis), this chart only mentions systems about which scientific, peer-reviewed papers have been published. That's why Scribe and Presto are missing; I couldn't find any papers about them. For Scribe, the Github repo I found explicitly says that it is no longer being maintained, although maybe it's still used internally.

Re Haystack: I'm surprised it's deprecated -- the f4 paper (2014) heavily implies that it is still used for "hot" blobs.

Re HipHop: ok, I wasn't sure if it was still current, since I had heard somewhere that it's been superseded. Couldn't find anything definite saying so, though. If you have a pointer, that'd be great.

BTW, one reason I posted these on Twitter was the hope to get exactly this kind of fact-checking going, so I'm pleased that feedback is coming :-)

aristus · on June 3, 2015

HipHop's replacement was pretty widely reported: http://www.wired.com/2013/06/facebook-hhvm-saga/

This link has papers on pub/sub, HHVM, and so on: https://research.facebook.com/publications/

Re Haystack: possible I am misremembering, or the project to completely replace Haystack stalled since I left.

If you want to gather a more complete picture of infrastructure at these companies I suggest, well, not imposing the strange limitation of only reading peer-reviewed papers. Almost none of the stuff I worked on ended up in conference proceedings.

ms705 · on June 3, 2015

Thanks, I've added HHVM and marked HipHop as superseded.

I also added Wormhole, which I think is the pub/sub system you're referring to (published in NSDI 2015: https://www.usenix.org/system/files/conference/nsdi15/nsdi15...).

Updated version at original URL: http://malteschwarzkopf.de/research/assets/facebook-stack.pd...

Regarding the focus on academic papers: I agree that this does not reflect the entirety of what goes on inside companies (or indeed how open they are; FB and Google also release lots of open-source software). Certainly, only reading the papers would be a bad idea. However, a peer-reviewed paper (unlike, say, a blog post) is a permanent, citable reference and is part of the scientific literature. This sets a quality bar (enforced through peer review, which deemed the description to be plausible and accurate), and allows the amount of information to remain manageable. The number of other sources of information makes them impractical to write up concisely, and it is hard to say what ought to be included and what should not when going beyond published papers.

I don't think anyone should base their perception of how Google's or Facebook's stack works on these charts and bibliographies -- not least because they will quickly be out of date. However, I personally find them useful as a quick reference to comprehensive, high-quality descriptions of systems that are regularly mentioned in discussions :-)

Maro · on June 4, 2015

Other ways to get information:

1. Ask the developers per email.

2. Fly out to SF, visit the campus, have lunch.

Will work for FB (I do it all the time), Google people won't tell you anything.

kajecounterhack · on June 3, 2015

Re Hiphop I think HHVM + Hack (Facebook's internal "improved PHP") has superseded it but while HHVM is open sourced Hack isn't public.

chrsm · on June 3, 2015

Hack is just a piece of HHVM, it's open: http://hacklang.org/

kajecounterhack · on June 3, 2015

Ah thanks, I didn't realize it was out. I read a separate article that said it hadn't been released yet -- it was probably outdated.

nulltype · on June 3, 2015

Does Spanner talk to Bigtable? From reading the paper, I thought it was built directly on Colossus.

ms705 · on June 3, 2015

Ah, you're right!

I misread "This section [...] illustrate[s] how replication and distributed transactions have been layered onto our Bigtable-based implementation." in the paper as meaning that Spanner is partly layered upon BigTable, but what it really means is that the implementation is based upon (as in, inspired by) BigTable.

Spanner actually has its own tablet implementation as part of the spanserver (stored in Colossus) and does not use BigTable. I've amended the diagram to reflect this.

ms705 · on Dec 27, 2014

Another comparison point (curiously not referenced in the above paper) is this similar effort: https://www.usenix.org/conference/hotcloud13/workshop-progra... [PDF/slides/video]

From a quick skim, the Xilinx work synthesises the entire network stack (including TCP) in hardware, unlike the above study, which only supports UDP in the HW traffic manager.

The numbers in the Xilinx paper are more attractive than what this study found and the paper also includes power measurements (since Joule/request is one metric on which FPGAs and hardware do quite well compared to software).

That said, much of the latency gain likely comes from bypassing the OS kernel and its generalised network stack. There is plenty of existing work (unfortunately also not referenced in this paper) that does this and which achieves very low latency, albeit -- to be fair -- on x86 hardware. (Examples: Arrakis [1], IX [2] and MICA [3].)

[1] -- https://www.usenix.org/conference/osdi14/technical-sessions/...

[2] -- https://www.usenix.org/conference/osdi14/technical-sessions/...

[3] -- https://www.usenix.org/conference/nsdi14/technical-sessions/...

ms705 · on Dec 12, 2014

(Disclaimer: involved with the work linked.)

A short version of this tech report is due to appear in the January special issue on repeatable research of the Operating Systems review journal. A pre-print is available at http://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf (10 pages).

yawniek · on Dec 12, 2014

thanks for sharing!

ms705 · on Nov 25, 2014

Thanks for the comments! :) [Full disclosure: I'm the first author of the APSYS paper and one of the researchers working on DIOS.]

Cautionary note: the APSYS paper was written in April 2013 -- at the time, there was little more than the idea of "maybe someone should look into distributed OSes for data centres". Much time has passed since, and there is now a DIOS prototype as well as far more design work. Things have concretized significantly since.

The target environment for DIOS is a shared, multi-user, multi-job cluster, such as Google's warehouse-scale data centres: an environment where many different tasks run, which selectively share data but which must also be isolated from each other. (Consider, for example, a front-end web server that obtains records from a back-end key-value store, and which produces logs that are later processed by batch jobs.)

"Data intensitivity" comes into this by virtue of the scale of the overall system (thousands of tasks/machines), but also via the applicable optimizations. For example, one key optimization is caching of data in memory (cf. Spark, Tachyon). True, this can be done in middleware (as these examples show), but arguably the OS already does it (viz. the buffer cache). Unifying the OS notion of caching (usually at inode/block level) and the distributed system's notion (at object level) seems like a reasonable proposition. Likewise, the OS kernel scheduler's long-term load-balancer and the cluster scheduler likely have information that they would benefit from sharing (e.g. about the interaction of different tasks sharing a machine).

DIOS is a research project that aims to find out if there is gain to be had from changing the OS -- we set out to answer the question, rather than knowing what we wanted to find! (The APSYS paper was testing the waters in terms of what other people think, and we got some useful feedback from it.)

So far (bearing in mind that the work is ongoing), it looks like the compelling advantages from changing the kernel abstractions are:

1. Complete control over data-flow: being able to actually enforce policies like "the task may only use its inputs to deterministically generate its outputs" (as commonly assumed -- but not enforced -- in systems like MapReduce, Dryad, CIEL etc.). When running an application with DIOS system calls only, we know for sure that there is no way I/O can have happened other than via these system calls and on the objects exposed to the task. This is something only the kernel can ensure.

2. Opportunities for end-to-end performance optimization: by having clear semantics of abstractions that are valid across machines, we can move away from the conventional wisdom that the OS should treat all communication equally. Concretely, the buffer copy implied in using BSD sockets for network communication and the generality of a kernel network stack are two examples of one-size-fits-all OS design that we can side-step: in one use-case, we fast-track UDP packets destined for a coordination service in user-space through a network stack bypass that delivers them at far lower latency. DIOS can make this optimization because it has semantic information about the low-latency needs of the application available. (For sure, there are other ways of implementing this, but they either involve kernel changes or give up the ability to track network data-flow [e.g. kernel bypass solutions]).

3. Convenient removal of scalability-inhibiting abstractions: it's well known that some POSIX APIs induce poor scalability (see e.g. the Commuter work from MIT in SOSP 2013). One key example is the notion that FD numbers are allocated in a monotonically increasing sequence (thus requiring synchronization between threads). By redesigning the abstractions for data-intensive applications, we can fix these sorts of problems in passing.

Hope that helps!

ms705 · on Nov 25, 2014

(Full disclosure: I'm one of the researchers working on DIOS and first author on the linked paper above.)

Agreed -- the reliance on Linux is definitely something that we've found to be a mixed blessing.

We actually looked fairly seriously into other candidate OSes as starting points for building DIOS. We considered L4::Fiasco, xv6 and Barrelfish, and investigated Barrelfish in depth. (L4 and xv6, at least at the time, were available for 32-bit only, which clearly wasn't going to cut it for a warehouse-scale OS.) Ultimately, we went with Linux over Barrelfish due to the more comprehensive driver support, better documentation and the fact that it works on our test machines.

Barrelfish had great trouble booting on one of our machines with a lot of physical memory; after patching it, we managed to boot it, but did not have PCIe devices available (required for networking). To their credit, the Barrelfish team were very supportive, but progress was slow and working with Linux allowed us to get started right away.

However, using Linux has also bought us into a bunch of annoying non-scalable implementation choices. For example, DIOS relies on fast memory mapping, but Linux serializes all access to the mm_struct for a process using a single semaphore.

That said, DIOS is (deliberately) written in such a way that it should be possible to port it to other host kernels. Specifically, the kernel code consists of three parts: 1. a small patch to route the new system calls to handlers; 2. a BSD-licensed core module that contains the DIOS logic, but which is implemented in standard ANSI C and does not rely on any Linux-specific kernel features; 3. a GPL-licensed "DIOS abstraction layer" (DAL), which offers access to the Linux kernel facilities for process and memory management, VFS calls, etc.

While our current prototype is for Linux only, we intend to revisit the Barrelfish port and will also look into porting DIOS to a BSD OS in the future. Barrelfish especially should be interesting -- it's a very good fit for DIOS's abstractions.

jff · on Nov 25, 2014

Thanks for the reply!

As someone involved in computing research (sometimes OS kernel research, even) I can appreciate the need to actually get something going.

Ideally, if I was starting off a project with the intent of focusing on something like distributed data management, I'd want to sit down and figure out what I actually need and write a kernel to do that, rather than pulling in Linux with its DECnet drivers and 300+ system calls and what-not. Maybe a Unix-like system with POSIX calls isn't the best way to approach the problem. The problem is that you end up spending a lot of time dicking around with the scheduler and figuring out why memory allocation sometimes screws up, and less time implementing the real deal.

I've had good experiences working with a 64-bit Plan 9 kernel, since I was already familiar with the code and it was quite minimal, but I still ended up fighting the compatibility/driver crap you mentioned. It's also pretty easy to get functionality, but getting functionality and performance is a pain.