random
random

Reputation: 4028

Why protobuf is bad for large data structures?

I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients. I'm trying to apply protobuf because:

But Protobuf user guide says:

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

https://developers.google.com/protocol-buffers/docs/techniques#large-data

I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.

Why protobuf is bad for serializing large datasets? What should I use instead?

Upvotes: 39

Views: 18046

Answers (2)

Ken Bloom
Ken Bloom

Reputation: 58770

It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.

The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.

If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.

Upvotes: 24

jpa
jpa

Reputation: 12176

It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.

However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.

The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.

If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.

Upvotes: 38

Related Questions