user878242
user878242

Reputation:

Why is varint an efficient data representation?

I am currently studying the documentation of protocol buffers. Varints are described as:

Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

My question is why one would choose a representation that looses one bit on every byte? What are the benefits from this approach?

Upvotes: 26

Views: 14465

Answers (4)

chrisn
chrisn

Reputation: 778

It's a tradeoff that depends on the typical numbers you want to send.

  • If they're typically small, use a "varint" encoding (int32, int64, uint32, uint64, sint32, sint64, bool, enum)
  • If they're typically large, use non-"varint" encoding (fixed32, sfixed32, float, fixed64, sfixed64, double)

Here is a comparison of bytes used of sint64, int64 and fixed64 for various numbers:

[ RUN      ] ProtobufOutputStream.printNumberOfBytesOnWireForInts
                        number  sint64   int64 fixed64
                             0       2       2       9
                             1       2       2       9
                            -1       2      11       9
                            63       2       2       9
                           -63       2      11       9
                            64       3       2       9
                           -64       2      11       9
                         10000       4       3       9
                        -10000       4      11       9
           9223372036854775807      11      10       9
          -9223372036854775808      11      11       9

Upvotes: 9

Kenton Varda
Kenton Varda

Reputation: 45326

In practice, the vast majority of integer values are small. Even in cases where you expect that a value will sometimes be very large, and thus you make it 32-bit or even 64-bit, chances are it will usually be small, because statistically speaking most physical quantities follow power-law distributions. So if the small values can be stored in fewer bytes, it's OK if the large values take an extra byte.

About the only kinds of integers that don't benefit are things like hashes or randomly-generated ID numbers which don't actually represent a quantity, but just a bit string. For these, you should use Protobufs' fixed32 or fixed64 types.

Note that varint encoding saves space on the wire but is actually relatively slow, because it requires a lot of branches to encode/decode. It's not as slow as text encoding, of course, but as binary formats go it's not so great. This is one reason why Cap'n Proto decided to reverse this decision and just put fixed-width ints on the wire. Cap'n Proto also includes an optional "packing" algorithm which compresses away zero-valued bytes, which produces similar message sizes to Protobuf but is generally faster because the algorithm uses less branching.

(Disclosure: I am the author of Cap'n Proto, and also of most of the Protobuf code released by Google.)

Upvotes: 42

nos
nos

Reputation: 229334

It's to save space/bandwidth e.g. many programming languages and protocols have a fixed sized data types. e.g. an uint8_t, an uint16_t, uint32_t etc. and these takes up a fixed amount of bytes regardless of how big the value is. e.g. if you store the value 2 in an uint32_t, that takes up 4 bytes.

With an encoding such as varint used in protobuf, small values can take up a smaller space, and the value 2 only needs 1 byte of space to be transferred on the wire, while still being flexible enough to not restrict the range of the value that can be used.

This is often a net win if small values are more common than big values - which is often the case.

Upvotes: 22

Logicrat
Logicrat

Reputation: 4468

Using this approach you can save quite a bit of memory space and/or transmission time if you have a lot of small numbers, while still being able to represent arbitrarily large numbers. You don't have to , for example, allocate 8 bytes for each number. Most of those bytes would be zeros anyway if you have many small numbers. And with 8-byte numbers, you are limited to (2^64 - 1) in value, and would need to do something special if you had a value of 2^64 or greater.

With varint encoding you would normally save a lot of memory and gain the ability to represent numbers of any magnitude.

Upvotes: 3

Related Questions