Matroc
Matroc

Reputation: 23

Reading multiple delimited protobuf messages from a file on Windows

I'm writing a tool for my master thesis, that needs to read protobuf datastreams from a file. Until now I worked exclusively on Mac OS and everything was fine, but now I'm trying to run the tool on Windows too.

Sadly on Windows I am not able to read multiple consecutive messages from a single stream. I tried to narrow the problem down and came to following small program that reproduces the problem.

#include "tokens.pb.h"
#include <google/protobuf/io/coded_stream.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <fstream>

int main(int argc, char* argv[])
{
  std::fstream tokenFile(argv[1], std::ios_base::in);
  if(!tokenFile.is_open())
    return -1;
  google::protobuf::io::IstreamInputStream iis(&tokenFile);
  google::protobuf::io::CodedInputStream cis(&iis);

  while(true){
    google::protobuf::io::CodedInputStream::Limit l;
    unsigned int msgSize;
    if(!cis.ReadVarint32(&msgSize))
      return 0; // probably reached eof
    l = cis.PushLimit(msgSize);

    tokenio::Union msg;
    if(!msg.ParseFromCodedStream(&cis))
      return -2; // couldn't read msg

    if(cis.BytesUntilLimit() > 0)
      return -3; // msg was not read completely

    cis.PopLimit(l);

    if(!msg.has_string() &&
       !msg.has_file() &&
       !msg.has_token() &&
       !msg.has_type())
      return -4; // msg contains no data
  }
  return 0;
}

On Mac OS this runs fine and returns 0 after reading the whole file as I expected.

On Windows the first message is read without problems. For the second messageParseFromCodedInputStream still returns true but does not read any data. This results in a BytesUntilLimit value that is larger than 0 and a return value of -3. Of course the message also does not contain any useable data. Any further reads from cis will also fail, as if the end of the stream was reached, even though the file has not been read completely.

I also tried using a FileInputStream with a file descriptor for input with the same result. Removing Push/PopLimit and reading data using ReadString calls with explicit message sizes and then parsing from that string also didn't help.

The following protobuf file was used.

package tokenio;

message TokenType {
    required uint32 id   = 1;
    required string name = 2;
}

message StringInstance {
    required string value = 1;
    optional uint64 id    = 2;
}

message BeginOfFile {
    required uint64 name = 1;
    optional uint64 type = 2;
}

message Token {
    required uint32 type   = 1;
    required uint32 offset = 2;
    optional uint32 line   = 3;
    optional uint32 column = 4;
    optional uint64 value  = 5;
}

message Union {
    optional TokenType      type   = 1;
    optional StringInstance string = 2;
    optional BeginOfFile    file   = 3;
    optional Token          token  = 4;
}

And this is a sample input file.

The input file seems to be ok. At least its readable by the protobuf editor (on Windows and Mac OS) as well as the c++ implementation on Mac OS.

The code was tested:

What am I doing wrong?

Upvotes: 2

Views: 2341

Answers (1)

Igor Tandetnik
Igor Tandetnik

Reputation: 52621

Make it std::fstream tokenFile(argv[1], std::ios_base::in | std::ios_base::binary);. The default is text mode; on Mac and other Unix-like systems it doesn't matter, but on Windows in text mode you get CRLF sequences translated to LF, and ^Z (aka '\x1A') character treated as end-of-file indicator. Those characters might, by coincidence, occur in a binary stream, and cause trouble.

Upvotes: 2

Related Questions