alfa
alfa

Reputation: 3088

Bit reading puzzle (reading a binary file in C++)

I am trying to read the file 'train-images-idx3-ubyte', which can be found here along with the corresponding file format description (at the bottom of the webpage). When I look at the bytes with od -t x1 train-images-idx3-ubyte | less (hexadecimal, bytewise), I get the following output:

adress                    bytes
0000000 00 00 08 03 00 00 ea 60 00 00 00 1c 00 00 00 1c
0000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
...

This is what I expected according to 1. But when I try to read the data with C++ I've got a problem. What I do is this:

std::fstream trainingData("minst/train-images-idx3-ubyte",
    std::ios::in | std::ios::binary);
int8_t zero = 0, encoding = 0, dimension = 0;
int32_t samples = -1;
trainingData >> zero >> zero >> encoding >> dimension;
trainingData >> samples;
debugLogger << "training set image file, encoding = "
    << (int) encoding << ", dimension = "
    << (int) dimension << ", items = " << (int) samples << "\n";

But the output of these few lines of code is:

training set image file, encoding = 8, dimension = 3, items = 0

Everything but the number of instances (items, samples) is correct. I tried reading the next 4 bytes as int8_t and that gave me at least the same result as od. I cannot imagine how samples can be 0. What I actually wanted to read here was 10,000. Maybe you've got a clue?

Upvotes: 2

Views: 737

Answers (2)

je4d
je4d

Reputation: 7838

As mentioned in other answers, you need to use unformatted input, i.e. istream::read(...) instead of operator>>. Translating your code above to use read yields:

trainingData.read(reinterpret_cast<char*>(&zero), sizeof(zero));
trainingData.read(reinterpret_cast<char*>(&zero), sizeof(zero));
trainingData.read(reinterpret_cast<char*>(&encoding), sizeof(encoding));
trainingData.read(reinterpret_cast<char*>(&dimension), sizeof(dimension));
trainingData.read(reinterpret_cast<char*>(&samples), sizeof(samples));

Which gets you most of the way there - but 00 00 ea 60 looks like it's in Big-endian format, so you'll have to pass it through ntohl to make sense of it if you're running on an intel-based machine:

samples = ntohl(samples);

which gives encoding = 8, dimension = 3, items = 60000.

Upvotes: 2

fdh
fdh

Reputation: 5344

The input is formatted, which will result in you reading wrong results from the file. Reading from an unformatted input will provide the correct results.

Upvotes: 2

Related Questions