Reputation: 6703
I have a data frame with user_ids stored as an indexed frame_table in an HDFStore. Also in this HDF file is another table with actions the user took. I want to grab all of the actions taken by 1% of the users. The procedure is as follows:
#Get 1% of the user IDs
df_id = store.select('df_user_id', columns = ['id'])
1pct_users = rnd.sample(df_id.id.unique(), 0.01*len(df_id.id.unique()))
df_id = df_id[df_id.id.isin(1pct_users)]
Now I want to go back and get all of the additional info that describes the actions taken by these users from frame_tables identically indexed as df_user_id. As per this example and this question I have done the following:
1pct_actions = store.select('df_actions', where = pd.Term('index', 1pct_users.index))
This simply provides an empty data frame. In fact, if I copy and paste the example in the previous pandas doc link I also get an empty data frame. Did something change about Term
in recent pandas? I'm on pandas 0.12.
I'm not tied to any particular solution. As long as I can get hdfstore indices from a lookup on the df_id table (which is fast) and then directly pull those indices from the other frame tables.
Upvotes: 1
Views: 2121
Reputation: 128948
Here is the way to do it in 0.12. In 0.13, where can be an indexer (e.g. an array of locations, so this is much easier, see (Selecting using a where mask)[http://pandas.pydata.org/pandas-docs/dev/io.html#advanced-queries], then 2nd example down.
In [2]: df = DataFrame(dict(A=list(range(5)),B=list(range(5))))
In [3]: df
Out[3]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
In [4]: store = pd.HDFStore('test.h5',mode='w')
In [5]: store.append('df',df)
Select and return a coordinate object (just a wrapped location array) according to some where
In [6]: c = store.select_as_coordinates('df', ['index<3'])
Where accepts the Coordinate objects (and you can use them with any table, here would be your 'df_action' table)
In [7]: store.select('df', where=c)
Out[7]:
A B
0 0 0
1 1 1
2 2 2
In [8]: c
Out[8]: <pandas.io.pytables.Coordinates at 0x4669590>
In [9]: c.values
Out[9]: array([0, 1, 2])
If you want to manipulate this, then just assign the positions you want to the Coordinate object before passing to select
. (As I said above, this 'hack' is going away in 0.13, and you don't need this intermediate object)
In [8]: c.values = np.array([0,1])
In [9]: store.select('df', where=c)
Out[9]:
A B
0 0 0
1 1 1
store.close()
Upvotes: 3