nick
nick

Reputation: 872

How can i export pandas dataframe to a file with binary format and let c++ read it?

I have a pandas dataframe with different data type on it.

and i want to make it usable in c++, for performance reason, i want c++ read it in binary format.

for example:

In [4]: df = pd.DataFrame(np.reshape(range(9), (3, 3)))

In [5]: df
Out[5]: 
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

In [6]: df['ticker'] = 'helloworld'

In [7]: df['float'] = 1.12

In [8]: df
Out[8]: 
   0  1  2      ticker  float
0  0  1  2  helloworld   1.12
1  3  4  5  helloworld   1.12
2  6  7  8  helloworld   1.12

i tried numpy to_bytes, but seems not work.

with open('a.bin', 'wb') as f:
    f.write(df.values.tobytes())

c++ read:

#include <fstream>
#include <iostream>

using namespace std;

int main() {
  std::fstream f("a.bin", std::ios::binary | std::ios::in);
  char buff[256];
  f.read(buff, sizeof(buff));
  cout << *(int*)(buff + 2*sizeof(int)) << endl;  // should be 2, but get 0
}

how can i dump it out in a binary format, and let c++ read it?

Upvotes: 2

Views: 743

Answers (2)

JohnE
JohnE

Reputation: 30424

To elaborate on @JohnZwinck answer and comment, see the code below. I also have a utility program on github that automates most of this: dataset2binary

Here's the python code to output to a c-readable binary. As @JohnZwinck notes, you need to change the array type to S10 or such if you have character data. I'm also making some minor changes to your sample dataset for convenience (like making the column names strings).

df = pd.DataFrame( { 'i':[1,2,3] } )
df['ticker'] = 'helloworld'
df['x'] = 1.12

names = df.columns
arrays = [ df[col].values for col in names ]
formats = [ array.dtype.str if array.dtype != 'O'
            else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]

rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
rec_array.tofile('test.bin')

And here's some c code to read it in and check the means:

#include <stdio.h>
#include <ctype.h>

int main() {

    FILE *fp;

    #pragma pack(push,1)
    struct foobar {

      long i ;
      char ticker[10] ;
      double x;

    } foo ;
    #pragma pack(pop)

    fp = fopen( "test.bin", "rb");
    if (fp == NULL) {
        puts("Cannot open the file.");
        return 1;
    }

    int i_counter = 1 ;
    double means[2] = { 0. } ;

    while (fread(&foo, sizeof(foo), 1, fp) == 1) {
        means[0] += foo.i ;
        means[1] += foo.x ;
        i_counter ++ ;
    }

    printf( "%cmeans of numerical columns %c %c", 10, 10, 10 );
    printf( " i %f %c", means[0] / 3., 10 ) ;
    printf( " x %f %c", means[1] / 3., 10 ) ;
    printf("%c",10) ;

    fclose(fp);
    return 0;

}

One caveat is that this code doesn't include a line feed at the end of the string so you'll need to take care of that still.

Upvotes: 0

John Zwinck
John Zwinck

Reputation: 249333

There are a lot of ways to do this but the simplest, most efficient and least enterprise-grade is probably NumPy's ndarray.tofile():

df.to_numpy().tofile("/path/to/some/file")

That will write a binary file which is essentially an array of structs. You can then read it by defining the matching struct in C++ and reading into an instance of that struct repeatedly until you reach the end of the file.

Print df.to_numpy().dtype to know what that struct should look like.

Upvotes: 1

Related Questions