Safariba
Safariba

Reputation: 339

read specific columns from hdf5 file and pass conditions

I want to read only specific columns from HDF5 file and pass conditions on those columns. My concern is that I dont want to fetch all HDF5 file as dataframe in the memory. I want to get only my necessary columns with their conditions.

columns=['col1', 'col2']
condition= "col2==1"
groupname='\path\to\group'
Hdf5File=os.path.join('path\to\hdf5.h5')
with pd.HDFStore(Hdf5File, mode='r', format='table') as store:
     if groupname in store:
        df=pd.read_hdf(store, key=groupname, columns=columns, where=["col2==1"])

I get an error :

TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety

Then I use below line which returns only specific columns:

df=store[groupname][columns]

But I dont know how can I pass condition on it.

Upvotes: 8

Views: 7501

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210912

In order to be able to read HDF5 files conditionally, they must be saved in the table format and the corresponding columns must be indexed.

Demo:

df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.to_hdf('c:/temp/file.h5', 'df_key', format='t', data_columns=True)

In [10]: pd.read_hdf('c:/temp/file.h5', 'df_key', where="a > 0.5 and a < 0.75")
Out[10]:
           a         b         c         d         e
3   0.744123  0.515697  0.005335  0.017147  0.176254
5   0.555202  0.074128  0.874943  0.660555  0.776340
6   0.667145  0.278355  0.661728  0.705750  0.623682
8   0.701163  0.429860  0.223079  0.735633  0.476182
14  0.645130  0.302878  0.428298  0.969632  0.983690
15  0.633334  0.898632  0.881866  0.228983  0.216519
16  0.535633  0.906661  0.221823  0.608291  0.330101
17  0.715708  0.478515  0.002676  0.231314  0.075967
18  0.587762  0.262281  0.458854  0.811845  0.921100
21  0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
68  0.610958  0.874373  0.785681  0.147954  0.966443
72  0.619666  0.818202  0.378740  0.416452  0.903129
73  0.500782  0.536064  0.697678  0.654602  0.054445
74  0.638659  0.518900  0.210444  0.308874  0.604929
76  0.696883  0.601130  0.402640  0.150834  0.264218
77  0.692149  0.963457  0.364050  0.152215  0.622544
85  0.737854  0.055863  0.346940  0.003907  0.678405
91  0.644924  0.840488  0.151190  0.566749  0.181861
93  0.710590  0.900474  0.061603  0.144200  0.946062
95  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]

UPDATE:

If you can't change the HDF5 file, then consider the following technique:

In [13]: df = pd.concat([x.query("0.5 < a < 0.75")
                         for x in pd.read_hdf('c:/temp/file.h5', 'df_key', chunksize=10)],
                        ignore_index=True)

In [14]: df
Out[14]:
           a         b         c         d         e
0   0.744123  0.515697  0.005335  0.017147  0.176254
1   0.555202  0.074128  0.874943  0.660555  0.776340
2   0.667145  0.278355  0.661728  0.705750  0.623682
3   0.701163  0.429860  0.223079  0.735633  0.476182
4   0.645130  0.302878  0.428298  0.969632  0.983690
5   0.633334  0.898632  0.881866  0.228983  0.216519
6   0.535633  0.906661  0.221823  0.608291  0.330101
7   0.715708  0.478515  0.002676  0.231314  0.075967
8   0.587762  0.262281  0.458854  0.811845  0.921100
9   0.551251  0.537855  0.906546  0.169346  0.063612
..       ...       ...       ...       ...       ...
23  0.610958  0.874373  0.785681  0.147954  0.966443
24  0.619666  0.818202  0.378740  0.416452  0.903129
25  0.500782  0.536064  0.697678  0.654602  0.054445
26  0.638659  0.518900  0.210444  0.308874  0.604929
27  0.696883  0.601130  0.402640  0.150834  0.264218
28  0.692149  0.963457  0.364050  0.152215  0.622544
29  0.737854  0.055863  0.346940  0.003907  0.678405
30  0.644924  0.840488  0.151190  0.566749  0.181861
31  0.710590  0.900474  0.061603  0.144200  0.946062
32  0.601144  0.288909  0.074561  0.615098  0.737097

[33 rows x 5 columns]

Upvotes: 10

Related Questions