Reputation: 10482
I have some binary data and I was wondering how I can load that into pandas.
Can I somehow load it specifying the format it is in, and what the individual columns are called?
Edit:
Format is
int, int, int, float, int, int[256]
each comma separation represents a column in the data, i.e. the last 256 integers is one column.
Upvotes: 27
Views: 63492
Reputation: 681
Even though this is an old question, I was wondering the same thing and I didn't see a solution I liked.
When reading binary data with Python I have found numpy.fromfile
or numpy.fromstring
to be much faster than using the Python struct module. Binary data with mixed types can be efficiently read into a numpy array, using the methods above, as long as the data format is constant and can be described with a numpy data type object (numpy.dtype
).
import numpy as np
import pandas as pd
# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'),
('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame(data)
# Or if you want to explicitly set the column names
df = pd.DataFrame(data, columns=data.dtype.names)
Edits:
data.to_list()
. Thanks fxxcolumns
argumentUpvotes: 48
Reputation: 362
Recently I was confronted to a similar problem, with a much bigger structure though. I think I found an improvement of mowen's answer using utility method DataFrame.from_records. In the example above, this would give:
import numpy as np
import pandas as pd
# Create a dtype with the binary data format and the desired column names
dt = np.dtype([('a', 'i4'), ('b', 'i4'), ('c', 'i4'), ('d', 'f4'), ('e', 'i4'), ('f', 'i4', (256,))])
data = np.fromfile(file, dtype=dt)
df = pd.DataFrame.from_records(data)
In my case, it significantly sped up the process. I assume the improvement comes from not having to create an intermediate Python list, but rather directly create the DataFrame from the Numpy structured array.
Upvotes: 16
Reputation: 196
The following uses a compiled struct, which is a lot faster than a normal struct. An alternative is to use np.fromstring or np.fromfile, as mentioned above.
import struct, ctypes, os
import numpy as np, pandas as pd
mystruct = struct.Struct('iiifi256i')
buff = ctypes.create_string_buffer(mystruct.size)
with open(input_filename, mode='rb') as f:
nrows = os.fstat(f.fileno()).st_size / entry_size
dtype = 'i,i,i,d,i,i8'
array = np.empty((nrows,), dtype=dtype)
for row in xrange(row):
buff.raw = f.read(s.size)
record = mystruct.unpack_from(buff, 0)
#record = np.fromstring(buff, dtype=dtype)
array[row] = record
df = pd.DataFrame(array)
see also http://pymotw.com/2/struct/
Upvotes: 1
Reputation: 14619
Here's something to get you started.
from struct import unpack, calcsize
from pandas import DataFrame
entry_format = 'iiifi256i' #int, int, int, float, int, int[256]
field_names = ['a', 'b', 'c', 'd', 'e', 'f', ]
entry_size = calcsize(entry_format)
with open(input_filename, mode='rb') as f:
entry_count = os.fstat(f.fileno()).st_size / entry_size
for i in range(entry_count):
record = f.read(entry_size)
entry = unpack(entry_format, record)
entry_frame = dict( (n[0], n[1]) for n in zip(field_names, entry) )
DataFrame(entry_frame)
Upvotes: 1