nzomkxia
nzomkxia

Reputation: 1267

what's the storage format of protobuf data?

the .proto file:

package lm;
message helloworld
{
     required int32 id = 1;
     required string str = 2;
     optional int32 opt = 3;
}

the writer.cc file:

#include <iostream>
#include <string>
#include "lm.helloworld.pb.h"
#include <fstream>
using namespace std;

int main()
{
    lm::helloworld msg1;
    msg1.set_id(101000);
    msg1.set_str("helloworld,this is a protobuf writer");
    fstream output("log", ios::out | ios::trunc | ios::binary);
    string _data;
    msg1.SerializeToString(&_data);
    cout << _data << endl;
    if(!msg1.SerializeToOstream(&output))
    {
        cerr << "Failed to write msg" << endl;
        return -1;
    }
    return 0;
}

the reader.cc file:

#include <iostream>
#include <fstream>
#include <string>
#include "lm.helloworld.pb.h"
using namespace std;

void ListMsg(const lm::helloworld & msg)
{
    cout << msg.id() << endl;
    cout << msg.str() << endl;
}

int main(int argc, char* argv[])
{
    lm::helloworld msg1;
    {
        fstream input("log", ios::in | ios::binary);
        if (!msg1.ParseFromIstream(&input))
        {
            cerr << "Failed to parse address book." << endl;
            return -1;
        }
    }

    ListMsg(msg1);
    return 0;
}

It's a simple reader and writer model using protobuf. but what's in the log is a readable string a typed in the write.cc file instead of "numeric format",why is that?

Upvotes: 4

Views: 2495

Answers (1)

Marc Gravell
Marc Gravell

Reputation: 1063964

The encoding is described here.

Without an example of what comes out the other end, that is slightly hard to answer precisely, but there are two possible explanations of what you are seeing:

  1. you have explicitly switched into TextFormat in your code; this is very unlikely - and indeed, the primary use of TextFormat is debugging etc
  2. far more likely, you're just seeing the text data from your message in the binary; text is encoded as UTF-8, so if you open a protobuf file in a text editor, pieces of it will appear readable enough to display something in the file

The real question is: what are the actual bytes in your output file? If it is something like:

08-88-95-06-12-24-68-65-6C-6C-6F-77-6F-72-6C-64-2C-74-68-69-73-20-69-73-20-61-20-70-72-6F-74-6F-62-75-66-20-77-72-69-74-65-72

then that is the binary format; but note that most of that is simply the UTF-8 of the string "helloworld,this is a protobuf writer" - which dominates the message by sheer size:

68-65-6C-6C-6F-77-6F-72-6C-64-2C-74-68-69-73-20-69-73-20-61-20-70-72-6F-74-6F-62-75-66-20-77-72-69-74-65-72

So if you look in any text editor, it will appear as a few garbled characters at the start followed by the clearly legible helloworld,this is a protobuf writer.

The "binary" here is the bit at the start:

08-88-95-06-12-24

This is:

  • 08: header: field 1, varint
  • 88-95-06: the value (decimal) 101000 as a varint
  • 12: header: field 2, length-prefixed
  • 24: the value (decimal) 36 as a varint (the length of the string, in bytes)

The key points to note:

  • if your message is dominated by text, the yes, large parts of it will look human-readable even in binary form
  • look at the overheads; it to 6 bytes to encode the entire rest of the message, and 3 bytes of that was data (the 101000) - so only 3 bytes was actually lost as overhead; now compare and contrast to xml, json, etc to understand what protobuf is doing to help you

Upvotes: 3

Related Questions