nick
nick

Reputation: 872

Can pandas read c++ binary file directly?

I have a large file, which is outputed by my c++ code.

it save struct into file with binary format.

For example:

Struct A {
  char name[32]:
  int age;
  double height;
};

output code is like:

std::fstream f;
for (int i = 0; i < 10000000; ++ i)
  A a;
  f.write(&a, sizeof(a));

I want to handle it in python with pandas DataFrame.

Is there any good methos that can read it elegantly?

Upvotes: 0

Views: 555

Answers (2)

Pietro
Pietro

Reputation: 1120

Searching for read_bin I found this issue that suggests using np.fromfile to load the data into a numpy array, then converting to a dataframe:

import numpy as np
import pandas as pd

dt = np.dtype(
    [
        ("name", "S32"),   # 32-length zero-terminated bytes
        ("age", "i4"),     # 32-bit signed integer
        ("height", "f8"),  # 64-bit floating-point number
    ],
)

records = np.fromfile("filename.bin", dt)
df = pd.DataFrame(records)

Please note that I have not tested this code, so there could be some problems in the data types I picked:

  • the byte order might be different (big/small endian dt = np.dtype([('big', '>i4'), ('little', '<i4')]))
  • the type for the char array is a null terminated byte array, that I think will result in a bytes type object in python, so you might want to convert that to string (using df['name'] = df['name'].str.decode('utf-8'))

More info on the data types can be found in the numpy docs.

Cheers!

Upvotes: 3

tripleee
tripleee

Reputation: 189597

Untested, based on a quick review of the Python struct module's documentation.

import struct

def reader(filehandle):
    """
    Accept an open filehandle; read and yield tuples according to the
    specified format (see the source) until the filehandle is exhausted.
    """
    mystruct = struct.Struct("32sid")
    while True:
        buf = filehandle.read(mystruct.size)
        if len(buf) == 0:
            break
        name, age, height = mystruct.unpack(buf)
        yield name, age, height

Usage:

with open(filename, 'rb') as data:
    for name, age, height in reader(data):
        # do things with those values

I don't know enough about C++ endianness conventions to decide if you should add a modifier to swap around the byte order somewhere. I'm guessing if C++ and Python are both running on the same machine, you would not have to worry about this.

Upvotes: 1

Related Questions