Stewart
Stewart

Reputation: 3161

Reading a binary file in C++ returns unexpected values

I'm new to c++ and reading binary files. I'm trying to read a binary file that has it's first two bytes as a uint16 number. I have written the following code to read the file.

std::ifstream large;
std::string file_path = "C:\\Users\\stewart\\Desktop\\WIN95\\WIN95\\SC2K\\DATA\\LARGE.DAT";

large.open(file_path, std::ios::binary);
large.seekg(0, std::ios::beg);
short file_entries = 0;
large.read((char*)&file_entries, sizeof(short));

In the above code file_entries is -2815. Which is not what I would expect. I would expect this value to be set to 501. I have written a JS version in node that confirms this. Note both programs read the same file and return a different value.

function toArrayBuffer(buf) {
    var ab = new ArrayBuffer(buf.length);
    var view = new Uint8Array(ab);
    for (var i = 0; i < buf.length; ++i) {
        view[i] = buf[i];
    }
    return ab;
}

const fs = require('fs');
const path = "C:\\Users\\stewart\\Desktop\\WIN95\\WIN95\\SC2K\\DATA\\LARGE.DAT";
const fileContent = fs.readFileSync(path);
const dataView = new DataView(toArrayBuffer(fileContent));
const spriteFileEntries =  dataView.getInt16(0x00);

Why does the C++ version return the wrong value? What have I missed in my C++ understanding of how this should work?

Upvotes: 0

Views: 624

Answers (1)

Miles Budnek
Miles Budnek

Reputation: 30494

This is an endianness issue. The data in the file is stored in big-endian order, but your CPU uses little-endian order.

501 is 0x01F5 in hex. -2815 is 0xF501 in hex (assuming 16-bit twos-compliment). Notice that they are the same two bytes, just in the opposite order.

There are two ways that are commonly used to store or transmit multi-byte values. Most-significant-byte-first (AKA big-endian or network byte order) or least-significant-byte-first (AKA little-endian). Both orders are used very commonly, so it's important to know which order the file you're reading uses. Most network protocols use big-endian order, while most modern CPUs use little-endian order.

JavaScript's DataView.getInt16 assumes the data is big-endian by default since that's the order commonly used for transmitting data across networks. This makes sense since JavaScript is commonly embedded in a web browser, where interacting with data sent over a network is a common need. It automatically converts the data to the appropriate byte order when it converts it to a number.

C++'s integral types are stored using the native byte order of the CPU your program is compiled for. When you read into a short you're directly writing to the bytes that make up that short. No conversion is done. If the data being read is in the wrong order, the bytes will be interpreted as the wrong number.

Since the data you're reading is in network byte order, you can use the ntohs (network to host short) function to swap the bytes around. It would also be a good idea to use a fixed-width type instead of short, since short isn't guaranteed to be 16 bits wide:

std::string file_path = "C:\\Users\\stewart\\Desktop\\WIN95\\WIN95\\SC2K\\DATA\\LARGE.DAT";

std::ifstream large(file_path, std::ios::binary);
uint16_t file_entries;
large.read((char*)&file_entries, sizeof(file_entries));
file_entries = ntohs(file_entries);

ntohs can be found in the "Winsock2.h" header on Windows or the "arpa/inet.h" header on Linux and other POSIX-compliant OSes.

Upvotes: 3

Related Questions