almagest
almagest

Reputation: 161

Encoding, datatypes & packed repeated fields

I have some questions regarding packed fields, and storing/serializing data with protocolbuffers. What i want to do essentially, is to store 4MB of data to a file.

The data i have (in our embedded system) is received as uint8_t (a byte) and i want to store this data as efficiently as possible.

I have been testing a variety of protobuf setups (four);

repeated uint32_t datastruct = 1;
repeated uint32_t datastruct = 1 [packed = true]

with both variants assigned 1-to-1 (putting a uint8 to uint32) and both variants bitshifted with 4 values cramped into a uint32_t.

To my surprise the stored files are much larger than the original data. (the examples where i put a uint8 into uint32 was expected of course..) The best result i could achieve was 5.2MB for the 4MB data, which really isnt that good.

Have i misunderstood something vital here? I do realize that protobuf adds information to the packets, but 25% increase is too much imho.

Also using GzipOutputStream increases the size of the file instead of decreasing it.

Any tips would be very appreciated!

Thanks for your time.

Upvotes: 1

Views: 2942

Answers (2)

Marc Gravell
Marc Gravell

Reputation: 1062905

This answer is based on the assumption that you are using uint32 in .proto terms:

packed is a positive thing here (removes headers per value); however, by packing a single uint8 into a uint32, you are running into a facet of "varint" encoding - specifically, that if the most significant bit of the byte is set, it will take 2 bytes (varint uses 7 bits per byte for data, and one bit as continuation). Consequently, I would recommend switching to the bytes type, which represents any arbitrary chunk of bytes, and is encoded "as is", without any varint or similar. It wouldn't be repeated/packed - just:

[required|optional] bytes data = 1;

Another option would be to use fixed32 (repeated and packed), and place (via shifting) 4 bytes per value, but by the time you've done that you may as well go to bytes and have a more obvious 1:1 map.

Re gzip; it is not uncommon for gzip to increase the size of arbitrary binary without many repeated blocks. By contrast, if your protobuf document contains strings it is common for the size to shrink, as gzip can spot repeated blocks.

Upvotes: 2

Mike Seymour
Mike Seymour

Reputation: 254471

If you want to store a sequence of bytes, use the bytes datatype. The overhead will be tiny.

The extra overhead you're seeing from int32 comes from the variable-length coding; numbers are stored with 7 bits in each byte, and the 8th indicates whether there is more. So a full-scale 32 bit value will need 5 bytes to store. There is a fixed32 type, which always takes 32 bits (4 bytes) per value - that will be more efficient if most values need 32 bits to represent, and less efficient if most values are small.

Upvotes: 1

Related Questions