Reputation: 872
I have a pandas dataframe with different data type on it.
and i want to make it usable in c++, for performance reason, i want c++ read it in binary format.
for example:
In [4]: df = pd.DataFrame(np.reshape(range(9), (3, 3)))
In [5]: df
Out[5]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
In [6]: df['ticker'] = 'helloworld'
In [7]: df['float'] = 1.12
In [8]: df
Out[8]:
0 1 2 ticker float
0 0 1 2 helloworld 1.12
1 3 4 5 helloworld 1.12
2 6 7 8 helloworld 1.12
i tried numpy to_bytes, but seems not work.
with open('a.bin', 'wb') as f:
f.write(df.values.tobytes())
c++ read:
#include <fstream>
#include <iostream>
using namespace std;
int main() {
std::fstream f("a.bin", std::ios::binary | std::ios::in);
char buff[256];
f.read(buff, sizeof(buff));
cout << *(int*)(buff + 2*sizeof(int)) << endl; // should be 2, but get 0
}
how can i dump it out in a binary format, and let c++ read it?
Upvotes: 2
Views: 743
Reputation: 30424
To elaborate on @JohnZwinck answer and comment, see the code below. I also have a utility program on github that automates most of this: dataset2binary
Here's the python code to output to a c-readable binary. As @JohnZwinck notes, you need to change the array type to S10 or such if you have character data. I'm also making some minor changes to your sample dataset for convenience (like making the column names strings).
df = pd.DataFrame( { 'i':[1,2,3] } )
df['ticker'] = 'helloworld'
df['x'] = 1.12
names = df.columns
arrays = [ df[col].values for col in names ]
formats = [ array.dtype.str if array.dtype != 'O'
else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
rec_array.tofile('test.bin')
And here's some c code to read it in and check the means:
#include <stdio.h>
#include <ctype.h>
int main() {
FILE *fp;
#pragma pack(push,1)
struct foobar {
long i ;
char ticker[10] ;
double x;
} foo ;
#pragma pack(pop)
fp = fopen( "test.bin", "rb");
if (fp == NULL) {
puts("Cannot open the file.");
return 1;
}
int i_counter = 1 ;
double means[2] = { 0. } ;
while (fread(&foo, sizeof(foo), 1, fp) == 1) {
means[0] += foo.i ;
means[1] += foo.x ;
i_counter ++ ;
}
printf( "%cmeans of numerical columns %c %c", 10, 10, 10 );
printf( " i %f %c", means[0] / 3., 10 ) ;
printf( " x %f %c", means[1] / 3., 10 ) ;
printf("%c",10) ;
fclose(fp);
return 0;
}
One caveat is that this code doesn't include a line feed at the end of the string so you'll need to take care of that still.
Upvotes: 0
Reputation: 249333
There are a lot of ways to do this but the simplest, most efficient and least enterprise-grade is probably NumPy's ndarray.tofile()
:
df.to_numpy().tofile("/path/to/some/file")
That will write a binary file which is essentially an array of structs. You can then read it by defining the matching struct in C++ and reading into an instance of that struct repeatedly until you reach the end of the file.
Print df.to_numpy().dtype
to know what that struct should look like.
Upvotes: 1