Bruce Adams
Bruce Adams

Reputation: 5589

protocol buffers ParseFromString does not check end of message

I found an interesting gotcha with protocol buffers. If you have two similar messages it is possible to parse one as if it were the other using the C++ API or the command line.

The limited documentation for ParseFromString does not mention that it need not consume all the string and will not fail if it doesn't.

I had expected ParseFromString to fail to parse a message of type A if it is presented with a message of type B. After all the message contains extra data. However, this is not the case. An example script demonstrates the issue:

#!/bin/sh

cat - >./foobar.proto <<EOF
syntax = "proto3";
package demo;
message A
{
   uint64 foo = 1;
};

enum flagx { 
  y = 0; 
  z = 1; 
}

message B {
   uint64 foolish = 1;
   flagx bar = 2;
};

EOF

cat - >./mess.B.in.txtfmt <<EOF
foolish: 10
bar: y
EOF

cat - >./mess.in.txtfmt <<EOF
foo: 10
EOF

protoc --encode=demo.A foobar.proto <./mess.A.in.txtfmt >./mess.A.proto
protoc --encode=demo.B foobar.proto <./mess.B.in.txtfmt >./mess.B.proto
protoc --decode=demo.A foobar.proto >./mess.out.txtfmt <./mess.B.proto

echo "in: "
cat mess.B.in.txtfmt
echo "out: "
cat mess.out.txtfmt

echo "xxd mess.A.proto:"
xxd mess.A.proto

echo "xxd mess.B.proto:"
xxd mess.B.proto

The output is:

in: 
foolish: 10
bar: 20
out: 
foo: 10
xxd mess.A.proto:
00000000: 080a                                    
xxd mess.B.proto:
00000000: 080a

So the binary message is identical for both messages A and B.

If you alter the protocol so that instead of an enum you have another varint (uint64) you get distinct binary messages but ParseFromString will still successfully parse the longer message as the shorter one.

To really confuse things it also seems to be able to parse the shorter message as the longer one.

Is this a bug or a feature?

Upvotes: 3

Views: 1753

Answers (1)

Bruce Adams
Bruce Adams

Reputation: 5589

I think this is by design but the documentation could be better.

This confusion may arise if you try to use the API without reading up about the over the wire format first. The wire format is not irrelevant to the API as you might expect.

The wire format emphasises compactness over correctness. If you want to check the correctness of a message you are invited to use other means.

You might (arguably should or must) include in your message one or more of the following:

  • A message type field
  • A message length field
  • A checksum

The second point about being able to parse a shorter message as a longer one is because in protocol buffers 3 all fields are optional. protocol buffers 2 had a concept of a required field. Its removal caused some controversy (see for example Why required and optional is removed in Protocol Buffers 3 & https://capnproto.org/faq.html#how-do-i-make-a-field-required-like-in-protocol-buffers). A field that has the default value (typically 0) is not included in the message. Also the name of fields are replaced by numbers. Thus two messages for 'different' protocol might very easily be interpretable by both.

Upvotes: 3

Related Questions