Reputation: 339
I want to read only specific columns from HDF5 file and pass conditions on those columns. My concern is that I dont want to fetch all HDF5 file as dataframe in the memory. I want to get only my necessary columns with their conditions.
columns=['col1', 'col2']
condition= "col2==1"
groupname='\path\to\group'
Hdf5File=os.path.join('path\to\hdf5.h5')
with pd.HDFStore(Hdf5File, mode='r', format='table') as store:
if groupname in store:
df=pd.read_hdf(store, key=groupname, columns=columns, where=["col2==1"])
I get an error :
TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety
Then I use below line which returns only specific columns:
df=store[groupname][columns]
But I dont know how can I pass condition on it.
Upvotes: 8
Views: 7501
Reputation: 210912
In order to be able to read HDF5 files conditionally, they must be saved in the table
format and the corresponding columns must be indexed.
Demo:
df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.to_hdf('c:/temp/file.h5', 'df_key', format='t', data_columns=True)
In [10]: pd.read_hdf('c:/temp/file.h5', 'df_key', where="a > 0.5 and a < 0.75")
Out[10]:
a b c d e
3 0.744123 0.515697 0.005335 0.017147 0.176254
5 0.555202 0.074128 0.874943 0.660555 0.776340
6 0.667145 0.278355 0.661728 0.705750 0.623682
8 0.701163 0.429860 0.223079 0.735633 0.476182
14 0.645130 0.302878 0.428298 0.969632 0.983690
15 0.633334 0.898632 0.881866 0.228983 0.216519
16 0.535633 0.906661 0.221823 0.608291 0.330101
17 0.715708 0.478515 0.002676 0.231314 0.075967
18 0.587762 0.262281 0.458854 0.811845 0.921100
21 0.551251 0.537855 0.906546 0.169346 0.063612
.. ... ... ... ... ...
68 0.610958 0.874373 0.785681 0.147954 0.966443
72 0.619666 0.818202 0.378740 0.416452 0.903129
73 0.500782 0.536064 0.697678 0.654602 0.054445
74 0.638659 0.518900 0.210444 0.308874 0.604929
76 0.696883 0.601130 0.402640 0.150834 0.264218
77 0.692149 0.963457 0.364050 0.152215 0.622544
85 0.737854 0.055863 0.346940 0.003907 0.678405
91 0.644924 0.840488 0.151190 0.566749 0.181861
93 0.710590 0.900474 0.061603 0.144200 0.946062
95 0.601144 0.288909 0.074561 0.615098 0.737097
[33 rows x 5 columns]
UPDATE:
If you can't change the HDF5 file, then consider the following technique:
In [13]: df = pd.concat([x.query("0.5 < a < 0.75")
for x in pd.read_hdf('c:/temp/file.h5', 'df_key', chunksize=10)],
ignore_index=True)
In [14]: df
Out[14]:
a b c d e
0 0.744123 0.515697 0.005335 0.017147 0.176254
1 0.555202 0.074128 0.874943 0.660555 0.776340
2 0.667145 0.278355 0.661728 0.705750 0.623682
3 0.701163 0.429860 0.223079 0.735633 0.476182
4 0.645130 0.302878 0.428298 0.969632 0.983690
5 0.633334 0.898632 0.881866 0.228983 0.216519
6 0.535633 0.906661 0.221823 0.608291 0.330101
7 0.715708 0.478515 0.002676 0.231314 0.075967
8 0.587762 0.262281 0.458854 0.811845 0.921100
9 0.551251 0.537855 0.906546 0.169346 0.063612
.. ... ... ... ... ...
23 0.610958 0.874373 0.785681 0.147954 0.966443
24 0.619666 0.818202 0.378740 0.416452 0.903129
25 0.500782 0.536064 0.697678 0.654602 0.054445
26 0.638659 0.518900 0.210444 0.308874 0.604929
27 0.696883 0.601130 0.402640 0.150834 0.264218
28 0.692149 0.963457 0.364050 0.152215 0.622544
29 0.737854 0.055863 0.346940 0.003907 0.678405
30 0.644924 0.840488 0.151190 0.566749 0.181861
31 0.710590 0.900474 0.061603 0.144200 0.946062
32 0.601144 0.288909 0.074561 0.615098 0.737097
[33 rows x 5 columns]
Upvotes: 10