how do I avoid strings being read as bytes when reading a HDF 5 file into Pandas?

Question

currently, the data in h5 file does not have prefix 'b'. I read h5 file with following code. I wonder whether there is some better way to read h5 and with no prefix 'b'.

import tables as tb
import pandas as pd
import numpy as np
import time

time0=time.time()
pth='d:/download/'

# read data
data_trading=pth+'Trading_v01.h5'
filem=tb.open_file(data_trading,mode='a',driver="H5FD_CORE")
tb_trading=filem.get_node(where='/', name='wind_data')
df=pd.DataFrame.from_records(tb_trading[:])
time1=time.time()
print('
time on reading data %6.3fs' %(time1-time0))

# in python3, remove prefix 'b'
df.loc[:,'Date']=[[dt.decode('utf-8')] for dt in df.loc[:,'Date']]
df.loc[:,'Code']=[[cd.decode('utf-8')] for cd in df.loc[:,'Code']]

time2=time.time()
print("
time on removing prefix 'b' %6.3fs" %(time2-time1))
print('
total time %6.3fs' %(time2-time0))

the result of time

time on reading data 1.569s

time on removing prefix 'b' 29.921s

total time 31.490s

you see, removing prefix 'b' is really time consuming.

I have try to use pd.read_hdf, which don't rise prefix 'b'.

%time df2=pd.read_hdf(data_trading)
Wall time: 14.7 s

which so far is faster.

Using this SO answer and using a vectorised str.decode, I can cut the conversion time to 9.1 seconds (and thus the total time less than 11 seconds):

 for key in ['Date', 'Code']: 
     df[key] = df[key].str.decode("utf-8")

Question: is there an even more effective way to convert my bytes columns to string when reading a HDF 5 data table?

John Zwinck · Accepted Answer

The best solution for performance is to stop trying to "remove the b prefix." The b prefix is there because your data consists of bytes, and Python 3 insists on displaying this prefix to indicate bytes in many places. Even places where it makes no sense such as the output of the built-in csv module.

But inside your own program this may not hurt anything, and in fact if you want the highest performance you may be better off leaving these columns as bytes. This is especially true if you're using Python 3.0 to 3.2, which always use multi-byte unicode representation (see).

Even if you are using Python 3.3 or later, where the conversion from bytes to unicode doesn't cost you any extra space, it may still be a waste of time if you have a lot of data.

Finally, Pandas is not optimal if you are dealing with columns of mostly unique strings which have a somewhat consistent width. For example if you have columns of text data which are license plate numbers, all of them will fit in about 9 characters. The inefficiency arises because Pandas does not exactly have a string column type, but instead uses an object column type, which contains pointers to strings stored separately. This is bad for CPU caches, bad for memory bandwidth, and bad for memory consumption (again, if your strings are mostly unique and of similar lengths). If your strings have highly variable widths, it may be worth it because a short string takes only its own length plus a pointer, whereas the fixed-width storage typical in NumPy and HDF5 takes the full column width for every string (even empty ones).

To get fast, fixed-width string columns in Python, you may consider using NumPy, which you can read via the excellent h5py library. This will give you a NumPy array which is a lot more similar to the underlying data stored in HDF5. It may still have the b prefix, because Python insists that non-unicode strings always display this prefix, but that's not necessarily something you should try to prevent.

how do I avoid strings being read as bytes when reading a HDF 5 file into Pandas?

Answers (1)

Related Questions