user3641187
user3641187

Reputation: 415

Cannot correctly parse a binary using C++ ifstream

I am trying to parse a NASDAQ ITCH protocol data dump using C++. These are large files, for anyone interested available here:

ftp://emi.nasdaq.com/ITCH/

The specs for these files boils down to this:

  1. A two byte big-endian length that indicates the length of the rest of the packet
  2. A single byte ASCII header indicating type
  3. The variable length payload (size: length-1)

To be sure the downloaded file contents are ok, I did a quick check in python, using the python buffered gzip reader. The contents parse as expected:

bin_data = gzip.open('01302020.NASDAQ_ITCH50.gz', 'rb')
message_size_bytes = bin_data.read(2)
message_size = int.from_bytes(message_size_bytes, byteorder='big', signed=False)
message_type = bin_data.read(1).decode('ascii')
record = bin_data.read(message_size - 1)
print("size: " + str(message_size) + " type: " + message_type)
# >>> size: 12 type: S

message_size in this particular case prints 12, which is the correct value. The following character S is also correct.

However, my own attempts at replicating the correct python parse behaviour using std::ifstream all fail. I can't even read the first 2 bytes (which should indicate total remaining payload size of 12) correctly. Here are my attempts, some of which are, at this point, just shots into the dark:

#include <iostream>
#include <fstream>

int main() {
std::string filepath = "/Users/estebanlanter/Documents/Finance/HFT/01302020.NASDAQ_ITCH50.gz";
std::ifstream ifs;
ifs.open(filepath, std::ifstream::in);
std::cout<<"open...."<<std::endl;

// trial A
ifs.clear();
ifs.seekg(0);
int size_a;
ifs.read(reinterpret_cast<char*>(&size_a), 2);
std::cout<<"size: "<<size_a<<std::endl;
// size: 325683999   

// trial B
ifs.clear();
ifs.seekg(0);
int size_b;
ifs.read(reinterpret_cast<char*>(&size_b), 2);
size_b = ntohl(size_b);
std::cout<<"size: "<<size_b<<std::endl;
// size: 529203200   

// trial C
ifs.clear();
ifs.seekg(0);
int size_c;
ifs.read(reinterpret_cast<char*>(&size_c), 2);
size_c = ntohs(size_c);
std::cout<<"size: "<<size_c<<std::endl;
// size: 8075   

// trial D
ifs.clear();
ifs.seekg(0);
uint8_t size_d;
ifs.read(reinterpret_cast<char*>(&size_d), 2);
std::cout<<"size: "<<size_d<<std::endl;
// size:   

// trial E
ifs.clear();
ifs.seekg(0);
uint8_t size_e;
ifs.read(reinterpret_cast<char*>(&size_e), 2);
size_e = ntohl(size_e);
std::cout<<"size: "<<size_e<<std::endl;
// size: 


// trial F
ifs.clear();
ifs.seekg(0);
uint8_t size_f;
ifs.read(reinterpret_cast<char*>(&size_f), 2);
size_f = ntohs(size_f);
std::cout<<"size: "<<size_f<<std::endl;
// size: 



// trial G
ifs.clear();
ifs.seekg(0);
char size_g;
ifs.read(&size_g, 2);
std::cout<<"size: "<<size_g<<std::endl;
// size: 

// trial H
ifs.clear();
ifs.seekg(0);
char size_h;
ifs.read(&size_h, 2);
size_h = ntohl(size_h);
std::cout<<"size: "<<size_h<<std::endl;
// size: 

// trial I
ifs.clear();
ifs.seekg(0);
char size_i;
ifs.read(&size_i, 2);
size_i = ntohs(size_i);
std::cout<<"size: "<<size_i<<std::endl;
// size: 

What am I doing wrong? how do I parse the first 2 bytes as an integer and the next bytes as a character? It seems to simple using Python...

I am btw. on a little endian MAC OSX machine - the data in the gzip is big endian.

EDIT

As some correctly pointed out, int is the wrong type to support 2 bytes. Also, I've replaced the ifstream flag with std::ios::binary. Unfortunately, still not printing the correct value...

std::string filepath = "/Users/estebanlanter/Documents/Finance/HFT/01302020.NASDAQ_ITCH50.gz";
std::ifstream ifs;
ifs.open(filepath, std::ios::binary);


std::cout<<"open...."<<std::endl;


ifs.clear();
ifs.seekg(0);
unsigned short size_a;
ifs.read(reinterpret_cast<char*>(&size_a), 2);
std::cout<<"size: "<<size_a<<std::endl;
// size: 35615    


ifs.clear();
ifs.seekg(0);
short size_b;
ifs.read(reinterpret_cast<char*>(&size_b), 2);
std::cout<<"size: "<<size_b<<std::endl;
// size: -29921

EDIT 2

User Casey pointed out its good practice to use guaranteed size types. Since I know that the size consists of 2 bytes (and is always positive), I declare the size as uint16_t. However, still no luck of arriving at the number 12 returned by the python parser....

int main() {
std::string filepath = "/Users/estebanlanter/Documents/Finance/HFT/01302020.NASDAQ_ITCH50.gz";
std::ifstream ifs;
ifs.open(filepath, std::ios::binary);


std::cout<<"open...."<<std::endl;


ifs.clear();
ifs.seekg(0);
uint16_t size_a;
ifs.read(reinterpret_cast<char*>(&size_a), 2);
std::cout<<"size: "<<size_a<<std::endl;
// size: 35615

Upvotes: 0

Views: 406

Answers (2)

tensored
tensored

Reputation: 1

Try unzipping the file. Zipped file has different binary encoded messages, upon unzipping it must work properly.

Upvotes: 0

user3641187
user3641187

Reputation: 415

@Retired Ninja correctly pointed out that the file is gzipped, which was the source of the error. I basically just incorrectly glossed over the fact that in python I was calling a gzip-specific io reader. After unzipping the file, I can correctly parse the first bytes.

Upvotes: 1

Related Questions