Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Greplin (YC W10) open sources 10-15x faster protocol buffers for Python (github.com/greplin)
197 points by rwalker on Jan 26, 2011 | hide | past | favorite | 33 comments


For a long time (much longer than I expected it would take) I've been working on a protobuf implementation in C that does not use Google's C++ implementation at all. I've been through about three rewrites and I finally have the interface right. I'm hoping it will be usable with Python soon (weeks).

https://github.com/haberman/upb/wiki

(if anyone's looking at the code, I'm working on the src-refactoring branch at the moment)

The benefits of my approach are:

* you can avoid depending on a 1MB C++ library. upb is more like 30k compiled.

* you can avoid doing any code generation. instead you just load the .proto schema at runtime, so you don't have to get a C++ compiler involved.

* Google's protobuf library does have a dynamic/reflection option that avoids my previous point, but it is ~10x slower than generating C++ code. My library, last time I benchmarked it, was 70-90% of the speed of generated C++.


Here's a fast Python C extension for protobuf that's already usable:

https://github.com/acg/lwpb

I read through your upb code about 3-4 months ago, was initially impressed, but couldn't get the Python extension to work. Certain abstractions really lost me, like pushing and pulling between sources and sinks. Why not just let a top-level event loop run the show in terms of buffered reads and size calculation for writes? But maybe you've refactored since.


> but couldn't get the Python extension to work.

Sorry, I should be clearer about the current state of the code, which for the Python extension is: currently completely broken. Since I was focusing on the core interfaces, the more peripheral pieces (like the language extensions) are totally broken at the moment.

> Certain abstractions really lost me, like pushing and pulling between sources and sinks.

Hopefully more documentation will make this clear. Making sources and sinks a general abstraction makes the event-based interface independent of any specific serialization (like protobuf binary format, protobuf text format, a JSON serialization of the same schema, etc). The key thing about protobufs is that a .proto file defines a typed tree structure, and the core interfaces of upb let you iterate over that tree structure, regardless of how exactly that tree structure was serialized.

> Why not just let a top-level event loop run the show in terms of buffered reads and size calculation for writes?

In my most recent interface, the upb_src does indeed run an event loop, and calls callbacks for every input value, or on a submessage start, or on a submessage end. For a while I wanted to make upb_src a pull-based interface instead, to give the application more control over the main loop, but this created more problems than it solved.


Can you clarify what you mean by "70-90% of the speed of generated C++"?

Suppose that the generated C++ takes 1.0 seconds. Does your implementation take 0.7-0.9s or 1.7-1.9s or something else?


If the generated C++ can parse 1MB/s, I can parse at 700-900kB/s. 70-90% of the speed, not the time. So 1.1-1.4 seconds, in your example.


Looks interesting. I might need to dig in.

I like the license, too.


I too have a speedy Protocol Buffer implementation in Python:

https://github.com/acg/lwpb

It clocks in at 11x faster than json, the same speedup reported by fast-pb. Only with lwpb:

* There's no codegen step -- which is a disgusting thing in a dynamic language, if you ask me.

* You're not forced into object oriented programming, with lwpb you can decode and encode dicts.

Most of haberman's remarks apply to lwpb as well, ie it's fast, small, and doesn't pull huge dependencies. The lwpb C code was originally written by Simon Kallweit and is similar in intent to upb.


fast and small footprint, as components should be


We (http://connex.io/) use Protocol Buffers quite heavily, and Python implementation was the performance bottleneck in many places.

I was working on same thing, CyPB, which is 17 times faster than Google's Python implementation. https://github.com/connexio/cypb

This one seems more complete though at the moment. I might just mark the ticket in our tracker as closed and switch to fastpb :-/


Nifty. I've passed it along to the appropriate folks.

Google uses SWIG-wrapped C++ proto bindings in Python pretty extensively, so I'm not sure how much this gets over that approach. I checked out the source; it's basically using Jinja templates to autogen Python/C API calls. Basically like SWIG, but not using SWIG.


When I was at Google I worked with very large structured protocol buffers in Python at one point. A single piece of data could be hundreds of MB in total, consisting of millions of smaller protocol buffers. I was doing a pass over the whole structure so needed to access each smaller PB from Python.

One day I decided my program was too slow so I profiled it and saw that the hot spots were in the Python protocol buffer implementation. "Easy", I thought, "I'll used SWIGed c++ PBs instead." Made some changes and ran the program again. Almost the exact same run time as before! I profiled again and found that this time the hot spots were in the SWIG layer. I was making so many calls through SWIG to c++ (because I was walking millions of objects), that using SWIGed PBs v. native Python PBs made no difference to my run time. Maybe I could have done some more custom SWIG work to lower the call overhead, but I remember being convinced at the time that SWIG wasn't going to do the trick.

So I ended up writing a 30 line Python extension that processed the protocol buffers in c++ and put the data into Python data structures. Run time was reduced by a factor of 10, hooray!


I hope to see these changes incorporated within Google's official implementation.

As of right now, deserialization of json and xml are way faster in Python: http://stackoverflow.com/questions/499593/whats-the-best-ser...


It doesn't appear to actually be open source:

> # Copyright 2010 Greplin, Inc. All Rights Reserved.

Where's the license?

I think the term you want is "publishes", and not "open sources".


Good catch - updating now. It'll be under Apache 2.0. (edit: done)


Excellent! I mean, Apache's a bit of an overly complex license implementation for what it does, but what it does is pretty good.

Thanks.


As an aside, I don't really like the idea that "open source" should also have to be synonymous with "free software". Can the intended users access (and possibly modify at least locally) the source? Then it's open source. Why do we need "open source" to be identical to "free software"? Isn't that what "free software" means?


Any downvoters want to explain here? I don't understand what's wrong with the question. Please explain more fully why "open source" should be synonymous with "free software", or, if you downvoted for another reason, please tell me where I went wrong. Thanks.


I didn't downvote (I haven't even been to the page since before you posted that comment until just now), but maybe I can help you understand why I disagree with what you said:

http://www.techrepublic.com/blog/opensource/when-open-source...

tl;dr: The term "open source" was coined by the people who founded the OSI, and they provided a definition for it. The term many people in the open source community use for software whose source is available (but not open source) is "source-available software".


This is very welcome, but I hope Google fixes this problem in the official protobuf distribution.

It looks like protobuf 2.4.0 has experimental support for backing Python protocol buffers with C++ via the PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION environment variable:

http://protobuf.googlecode.com/svn/trunk/CHANGES.txt


Is this really a better approach than using Cython to wrap a c++ or c implementation?


You should add cPickle to the benchmark as well -- I bet fast-pb still comes out ahead, and that may be an eye opener for many Python devs.


MessagePack is up to 4x faster* than protobuf, and easier to work with btw IMHO.

http://msgpack.org/

I used it as the native format for DripDrop (https://github.com/andrewvc/dripdrop)

* In Some tests


> * In Some tests

In that test protobuf is forced to copy the 512 byte string (200,000) times, while it appears that MessagePack is referencing it.

Granted it's a bummer that protobuf can't do this easily (my protobuf library upb can -- see above post), but I think it's dishonest not to mention that a large portion of the difference (if not all of it) is just memcpy() that protobuf is doing but MessagePack is not.

It reminds me of when I worked at Amazon and we had a developer conference with several speakers. One speaker was plugging Erlang and showed a graph comparing C++ processes with Erlang processes, and the graph showed C++ being much slower or bigger. Scott Meyers was in the audience and raised his hand to ask "what are the Erlang processes not doing, to explain the difference?" The guy couldn't answer that question directly.

After a bit of digging, you realize that an Erlang "process" is a lightweight, interpreter-level abstraction that is implemented inside a regular OS process. So naturally it doesn't have any of the overhead that is associated with an OS process, and you don't have to make a system call to perform IPC.

So when you're posting benchmark comparisons, I think it's only right to mention any inherent differences in how much work you're doing.


Do you mean four times faster or do you mean four times as fast? These terms are not synonymous. Four times faster means the same thing as five times as fast.

i.e.: If it is four times as fast, you multiply the speed by four, and that's the new speed. If it is four times faster, you multiply the speed by four and add it to the original speed, because that's how much faster it is than the original.

I fucking hate that television commercials have conflated the two for the public at large. If you have followed in their footsteps, I hope this public service announcement has helped you sort that out for the future.


Has anyone managed to run the fast-pb tests in benchmark.py? I'm not sure where this switch is coming from:

  protoc --fastpython_out


Have you installed both protocol buffers and the fast-python-pb module? Feel free to email me: robbyw@(the-company-mentioned-in-the-title).com


Thanks Robby, got the benchmark working, was trying to do a homedir install of fast-python-pb earlier.

I added a couple more tests to the benchmark, here are the results:

  JSON
  3.57209396362

  Protocol Buffer (fast-python-pb)
  0.325706005096

  Protocol Buffer (native)           
  4.83730196953

  Protocol Buffer (lwpb)
  0.32919216156

  cPickle
  0.837985038757
As you can see, lwpb and fast-python-pb are neck and neck. And I should point out that lwpb isn't using C++ codegen at all, just the compiled schema in the .proto file. Of course, if completeness of implementation was the critical thing, you'd probably want to stay closer to google's official implementation. There's a lot of the google implementation that I never use though, like the RPC stuff.

Also notable that both lwpb and fast-python-pb outperform cPickle by almost 3x. It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Here's a fork with the patched benchmark code: https://github.com/acg/fast-python-pb


> It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Because protobuf parses messages with a fixed schema in a very structured format. Pickle, OTOH, is an interpreted bytecode microlanguage used to describe arbitrary python objects (for instance, pickle can call python functions: http://nadiana.com/python-pickle-insecure)

Also, pickle supports references (so if an object is referenced two times in the same serialization stream, it is serialized only once) and this has a cost at serialization time (need to keep the set of seen references)


Perl's Storable can also serialize code references which get eval'ed during deserialization, and can also serialize multiple references once, though you have to be more explicit about that. And Storable is still 2x-3x faster than the already quite fast JSON::XS.

It would seem the bytecode interpreter architecture in Pickle is the limiting factor. If anybody has some good profiling data on Pickle though, I'd love to see it.


(c)pickle uses a string protocol by default which is the most portable (even across Python versions and platforms). You need to specify another protocol version for best performance. See http://docs.python.org/library/pickle.html#data-stream-forma...

The times should be similar to lwpb then.

Edit2: Oh, and for JSON you should use http://pypi.python.org/pypi/simplejson


Yup, simplejson is much faster than the standard json.

  JSON
  3.56521892548
  SimpleJSON 
  0.727998971939
  Protocol Buffer (fast)
  0.38397192955
  Protocol Buffer (standard)
  4.86640501022
  Protocol Buffer (lwpb)             
  0.323328971863
  cPickle    
  0.811990976334


I think you'll find py-yajl to be faster than any of the other Python JSON modules: https://github.com/rtyler/py-yajl


It would be interesting to know why a portable, cross-language serialization format beats out the language-specific one.

Odds are that it is because the language-specific one supports features that the portable one does not. A feature I'd be particularly suspicious of is, "Did we already encounter this data structure and serialize it?" Supporting that feature means tracking a LOT of information during the serialization process, whether or not you get to use it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: