Greplin (YC W10) open sources 10-15x faster protocol buffers for Python

haberman · on Jan 26, 2011

For a long time (much longer than I expected it would take) I've been working on a protobuf implementation in C that does not use Google's C++ implementation at all. I've been through about three rewrites and I finally have the interface right. I'm hoping it will be usable with Python soon (weeks).

https://github.com/haberman/upb/wiki

(if anyone's looking at the code, I'm working on the src-refactoring branch at the moment)

The benefits of my approach are:

* you can avoid depending on a 1MB C++ library. upb is more like 30k compiled.

* you can avoid doing any code generation. instead you just load the .proto schema at runtime, so you don't have to get a C++ compiler involved.

* Google's protobuf library does have a dynamic/reflection option that avoids my previous point, but it is ~10x slower than generating C++ code. My library, last time I benchmarked it, was 70-90% of the speed of generated C++.

sigil · on Jan 27, 2011

Here's a fast Python C extension for protobuf that's already usable:

https://github.com/acg/lwpb

I read through your upb code about 3-4 months ago, was initially impressed, but couldn't get the Python extension to work. Certain abstractions really lost me, like pushing and pulling between sources and sinks. Why not just let a top-level event loop run the show in terms of buffered reads and size calculation for writes? But maybe you've refactored since.

haberman · on Jan 27, 2011

> but couldn't get the Python extension to work.

Sorry, I should be clearer about the current state of the code, which for the Python extension is: currently completely broken. Since I was focusing on the core interfaces, the more peripheral pieces (like the language extensions) are totally broken at the moment.

> Certain abstractions really lost me, like pushing and pulling between sources and sinks.

Hopefully more documentation will make this clear. Making sources and sinks a general abstraction makes the event-based interface independent of any specific serialization (like protobuf binary format, protobuf text format, a JSON serialization of the same schema, etc). The key thing about protobufs is that a .proto file defines a typed tree structure, and the core interfaces of upb let you iterate over that tree structure, regardless of how exactly that tree structure was serialized.

> Why not just let a top-level event loop run the show in terms of buffered reads and size calculation for writes?

In my most recent interface, the upb_src does indeed run an event loop, and calls callbacks for every input value, or on a submessage start, or on a submessage end. For a while I wanted to make upb_src a pull-based interface instead, to give the application more control over the main loop, but this created more problems than it solved.

jsarch · on Jan 26, 2011

Can you clarify what you mean by "70-90% of the speed of generated C++"?

Suppose that the generated C++ takes 1.0 seconds. Does your implementation take 0.7-0.9s or 1.7-1.9s or something else?

haberman · on Jan 26, 2011

If the generated C++ can parse 1MB/s, I can parse at 700-900kB/s. 70-90% of the speed, not the time. So 1.1-1.4 seconds, in your example.

apotheon · on Jan 27, 2011

Looks interesting. I might need to dig in.

I like the license, too.

sigil · on Jan 27, 2011

I too have a speedy Protocol Buffer implementation in Python:

https://github.com/acg/lwpb

It clocks in at 11x faster than json, the same speedup reported by fast-pb. Only with lwpb:

* There's no codegen step -- which is a disgusting thing in a dynamic language, if you ask me.

* You're not forced into object oriented programming, with lwpb you can decode and encode dicts.

Most of haberman's remarks apply to lwpb as well, ie it's fast, small, and doesn't pull huge dependencies. The lwpb C code was originally written by Simon Kallweit and is similar in intent to upb.

ssnot · on Jan 27, 2011

fast and small footprint, as components should be

atamyrat · on Jan 26, 2011

We (http://connex.io/) use Protocol Buffers quite heavily, and Python implementation was the performance bottleneck in many places.

I was working on same thing, CyPB, which is 17 times faster than Google's Python implementation. https://github.com/connexio/cypb

This one seems more complete though at the moment. I might just mark the ticket in our tracker as closed and switch to fastpb :-/

nostrademons · on Jan 26, 2011

Nifty. I've passed it along to the appropriate folks.

Google uses SWIG-wrapped C++ proto bindings in Python pretty extensively, so I'm not sure how much this gets over that approach. I checked out the source; it's basically using Jinja templates to autogen Python/C API calls. Basically like SWIG, but not using SWIG.

slewis · on Jan 27, 2011

When I was at Google I worked with very large structured protocol buffers in Python at one point. A single piece of data could be hundreds of MB in total, consisting of millions of smaller protocol buffers. I was doing a pass over the whole structure so needed to access each smaller PB from Python.

One day I decided my program was too slow so I profiled it and saw that the hot spots were in the Python protocol buffer implementation. "Easy", I thought, "I'll used SWIGed c++ PBs instead." Made some changes and ran the program again. Almost the exact same run time as before! I profiled again and found that this time the hot spots were in the SWIG layer. I was making so many calls through SWIG to c++ (because I was walking millions of objects), that using SWIGed PBs v. native Python PBs made no difference to my run time. Maybe I could have done some more custom SWIG work to lower the call overhead, but I remember being convinced at the time that SWIG wasn't going to do the trick.

So I ended up writing a 30 line Python extension that processed the protocol buffers in c++ and put the data into Python data structures. Run time was reduced by a factor of 10, hooray!

peterlai · on Jan 26, 2011

I hope to see these changes incorporated within Google's official implementation.

As of right now, deserialization of json and xml are way faster in Python: http://stackoverflow.com/questions/499593/whats-the-best-ser...

apotheon · on Jan 26, 2011

It doesn't appear to actually be open source:

Where's the license?

I think the term you want is "publishes", and not "open sources".

rwalker · on Jan 26, 2011

Good catch - updating now. It'll be under Apache 2.0. (edit: done)

apotheon · on Jan 27, 2011

Excellent! I mean, Apache's a bit of an overly complex license implementation for what it does, but what it does is pretty good.

Thanks.

cookiecaper · on Jan 27, 2011

As an aside, I don't really like the idea that "open source" should also have to be synonymous with "free software". Can the intended users access (and possibly modify at least locally) the source? Then it's open source. Why do we need "open source" to be identical to "free software"? Isn't that what "free software" means?

cookiecaper · on Jan 27, 2011

Any downvoters want to explain here? I don't understand what's wrong with the question. Please explain more fully why "open source" should be synonymous with "free software", or, if you downvoted for another reason, please tell me where I went wrong. Thanks.

apotheon · on Jan 27, 2011

I didn't downvote (I haven't even been to the page since before you posted that comment until just now), but maybe I can help you understand why I disagree with what you said:

http://www.techrepublic.com/blog/opensource/when-open-source...

tl;dr: The term "open source" was coined by the people who founded the OSI, and they provided a definition for it. The term many people in the open source community use for software whose source is available (but not open source) is "source-available software".

dirtae · on Jan 26, 2011

This is very welcome, but I hope Google fixes this problem in the official protobuf distribution.

It looks like protobuf 2.4.0 has experimental support for backing Python protocol buffers with C++ via the PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION environment variable:

http://protobuf.googlecode.com/svn/trunk/CHANGES.txt

traviscline · on Jan 27, 2011

Is this really a better approach than using Cython to wrap a c++ or c implementation?

sigil · on Jan 27, 2011

You should add cPickle to the benchmark as well -- I bet fast-pb still comes out ahead, and that may be an eye opener for many Python devs.

andrewvc · on Jan 27, 2011

MessagePack is up to 4x faster* than protobuf, and easier to work with btw IMHO.

http://msgpack.org/

I used it as the native format for DripDrop (https://github.com/andrewvc/dripdrop)

* In Some tests

haberman · on Jan 27, 2011

> * In Some tests

In that test protobuf is forced to copy the 512 byte string (200,000) times, while it appears that MessagePack is referencing it.

Granted it's a bummer that protobuf can't do this easily (my protobuf library upb can -- see above post), but I think it's dishonest not to mention that a large portion of the difference (if not all of it) is just memcpy() that protobuf is doing but MessagePack is not.

It reminds me of when I worked at Amazon and we had a developer conference with several speakers. One speaker was plugging Erlang and showed a graph comparing C++ processes with Erlang processes, and the graph showed C++ being much slower or bigger. Scott Meyers was in the audience and raised his hand to ask "what are the Erlang processes not doing, to explain the difference?" The guy couldn't answer that question directly.

After a bit of digging, you realize that an Erlang "process" is a lightweight, interpreter-level abstraction that is implemented inside a regular OS process. So naturally it doesn't have any of the overhead that is associated with an OS process, and you don't have to make a system call to perform IPC.

So when you're posting benchmark comparisons, I think it's only right to mention any inherent differences in how much work you're doing.

apotheon · on Jan 27, 2011

Do you mean four times faster or do you mean four times as fast? These terms are not synonymous. Four times faster means the same thing as five times as fast.

i.e.: If it is four times as fast, you multiply the speed by four, and that's the new speed. If it is four times faster, you multiply the speed by four and add it to the original speed, because that's how much faster it is than the original.

I fucking hate that television commercials have conflated the two for the public at large. If you have followed in their footsteps, I hope this public service announcement has helped you sort that out for the future.

sigil · on Jan 27, 2011

Has anyone managed to run the fast-pb tests in benchmark.py? I'm not sure where this switch is coming from:

  protoc --fastpython_out

rwalker · on Jan 27, 2011

Have you installed both protocol buffers and the fast-python-pb module? Feel free to email me: robbyw@(the-company-mentioned-in-the-title).com

sigil · on Jan 27, 2011

Thanks Robby, got the benchmark working, was trying to do a homedir install of fast-python-pb earlier.

I added a couple more tests to the benchmark, here are the results:

  JSON
  3.57209396362

  Protocol Buffer (fast-python-pb)
  0.325706005096

  Protocol Buffer (native)           
  4.83730196953

  Protocol Buffer (lwpb)
  0.32919216156

  cPickle
  0.837985038757

As you can see, lwpb and fast-python-pb are neck and neck. And I should point out that lwpb isn't using C++ codegen at all, just the compiled schema in the .proto file. Of course, if completeness of implementation was the critical thing, you'd probably want to stay closer to google's official implementation. There's a lot of the google implementation that I never use though, like the RPC stuff.

Also notable that both lwpb and fast-python-pb outperform cPickle by almost 3x. It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Here's a fork with the patched benchmark code: https://github.com/acg/fast-python-pb

ot · on Jan 27, 2011

> It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Because protobuf parses messages with a fixed schema in a very structured format. Pickle, OTOH, is an interpreted bytecode microlanguage used to describe arbitrary python objects (for instance, pickle can call python functions: http://nadiana.com/python-pickle-insecure)

Also, pickle supports references (so if an object is referenced two times in the same serialization stream, it is serialized only once) and this has a cost at serialization time (need to keep the set of seen references)

sigil · on Jan 27, 2011

Perl's Storable can also serialize code references which get eval'ed during deserialization, and can also serialize multiple references once, though you have to be more explicit about that. And Storable is still 2x-3x faster than the already quite fast JSON::XS.

It would seem the bytecode interpreter architecture in Pickle is the limiting factor. If anybody has some good profiling data on Pickle though, I'd love to see it.

olt · on Jan 27, 2011

(c)pickle uses a string protocol by default which is the most portable (even across Python versions and platforms). You need to specify another protocol version for best performance. See http://docs.python.org/library/pickle.html#data-stream-forma...

The times should be similar to lwpb then.

Edit2: Oh, and for JSON you should use http://pypi.python.org/pypi/simplejson

sigil · on Jan 27, 2011

Yup, simplejson is much faster than the standard json.

  JSON
  3.56521892548
  SimpleJSON 
  0.727998971939
  Protocol Buffer (fast)
  0.38397192955
  Protocol Buffer (standard)
  4.86640501022
  Protocol Buffer (lwpb)             
  0.323328971863
  cPickle    
  0.811990976334

indigoviolet · on Jan 28, 2011

I think you'll find py-yajl to be faster than any of the other Python JSON modules: https://github.com/rtyler/py-yajl

btilly · on Jan 27, 2011

It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Odds are that it is because the language-specific one supports features that the portable one does not. A feature I'd be particularly suspicious of is, "Did we already encounter this data structure and serialize it?" Supporting that feature means tracking a LOT of information during the serialization process, whether or not you get to use it.