Reputation: 580
I would like to select from a HUGE hdf5 a subset of the data, day by day. It would be perfect to use a where mask, but I can't make it work with a multiindex (since I have to have a where with two conditions). can't use a where mask with a multiindex:
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = (0, 1, 2)
idx = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 7), index=idx,
columns=('I', 'II', 'III', 'IV', 'V', 'VI', 'VII'))
Ok, now I put it in a hdf store
from pandas.io.pytables import HDFStore
store =HDFStore('cancella.h5', 'w')
store.append('df_mask',df)
But if I read it again I have
c = store.select_column('df_mask','index')
print c
this index which is WRONG.
0 0
1 1
2 2
3 3
4 4
5 5
dtype: int64
So I can't use the where mask. Can you help me?
Upvotes: 4
Views: 486
Reputation: 1220
In current pandas version 0.24.2, you may use direct querying without intermediate MultiIndex retrieval:
idx = pd.MultiIndex.from_product([('A', 'B'), range(3)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx,
columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 0.547846 -1.630704 0.456686
>>> 1 -0.057442 -0.232721 0.349548
>>> 2 1.495230 0.661166 1.309862
>>> B 0 2.116818 0.477923 -0.252352
>>> 1 1.001081 2.578723 -0.040782
>>> 2 -1.286964 0.357969 0.002976
Just use built-in where
mask:
with pd.HDFStore('test.h5') as store:
store.put('df_mask', df, format='t')
store.select('df_mask', where = 'Alpha = A & Int in [0,2]')
>>> I II III
>>> Alpha Int
>>> A 0 0.621453 -1.757883 0.494433
>>> 2 -0.689012 -1.254884 0.655424
For different functions allowed in where mask visit docs.
But if your desired filters are very complicated and you want to retrieve MultiIndex and build mask 'by hands', you may select it like so:
with pd.HDFStore('test.h5') as store:
store.select('df_mask', columns=['index'])
>>> Alpha Int
>>> A 0
>>> 1
>>> 2
>>> B 0
>>> 1
>>> 2
Upvotes: 0
Reputation: 227
This may be orthogonal to your question, but I don't see a column or an index in your DataFrame called 'index' so I don't know what you're actually selecting with
c = store.select_column('df_mask','index')
I've always used the read_hdf
wrapper to deal with PyTables and have found it very flexible. The syntax for it is exceedingly flexible, e.g.
c = pd.read_hdf('/home/Jian/Downloads/temp.h5', 'df_mask', where="Alpha='A' | Alpha='B' & Int=0")
There's a firm limit to the number of conditions in your where clause, but if I've read your question correctly, that'd be a second order issue for you.
Upvotes: 0
Reputation: 109536
Have you tried using Pandas to_hdf
(after installing pytables)?
df.to_hdf('/Users/Alexander/Downloads/hdf.h5', key='data', format='t')
>>> pd.read_hdf('/Users/Alexander/Downloads/hdf.h5', 'data', columns='index')
Empty DataFrame
Columns: []
Index: [(A, 0), (A, 1), (A, 2), (B, 0), (B, 1), (B, 2)]
Choosing the format='t' option:
"Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data"
This documentation provides details of selecting data with a where
mask.
Upvotes: 0
Reputation: 24742
I guess this is because pandas might reset_index
before putting table into HDF5
(reason might be that it tries to avoid any potential duplicated index, which is allowed in dataframe but very bad for database), and use the integer auto-incremented primary key as the index in HDF5
table. So the code you wrote in the post selects these auto-incremented primary key.
There might be some other more elegant approaches, but I found the following work. (a consequence if pandas indeed reset_index
, so now that multi-level index becomes columns...)
import itertools
import pandas as pd
import numpy as np
a = ('A', 'B')
i = (0, 1, 2)
idx = pd.MultiIndex.from_tuples(list(itertools.product(a, i)),
names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 7), index=idx,
columns=('I', 'II', 'III', 'IV', 'V', 'VI', 'VII'))
print(df)
store = pd.HDFStore('/home/Jian/Downloads/temp.h5')
store.append('df_mask',df)
store.select('df_mask', columns=['Alpha','Int'])
Upvotes: 0