Reputation: 71
I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:
ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14
There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.
What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?
Thanks
Upvotes: 1
Views: 1726
Reputation: 128958
Pls review the HDFStore
docs here, and the cookboo recipies here
PyTables
stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.
On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple
methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.
Furthermore you always want to have the queryable columns as indexes
or data_columns
, as these allow queries in the first place and are indexed.
So it comes down to the ratio of queries that select lots of columns vs single-column selections.
For example
In [5]: df = DataFrame(np.random.randn(16,2),
columns=['A','B'],
index=MultiIndex.from_tuples(
[ (i,j) for i in range(4) for j in date_range(
'20130101 00:00:00',periods=4,freq='10T') ],
names=['id','date']))
In [6]: df
Out[6]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
1 2013-01-01 00:00:00 -0.009998 0.239947
2013-01-01 00:10:00 2.038563 0.640080
2013-01-01 00:20:00 1.123922 -0.944170
2013-01-01 00:30:00 -1.757766 -1.398392
2 2013-01-01 00:00:00 -1.053324 -1.015211
2013-01-01 00:10:00 0.062408 -1.476484
2013-01-01 00:20:00 -1.202875 -0.747429
2013-01-01 00:30:00 -0.798126 -0.485392
3 2013-01-01 00:00:00 0.496098 0.700073
2013-01-01 00:10:00 -0.042914 1.099115
2013-01-01 00:20:00 -1.762597 -0.239100
2013-01-01 00:30:00 -0.344125 -1.607524
[16 rows x 2 columns]
In 0.12, use table=True
rather than format
In [7]: df.to_hdf('test.h5','df',mode='w',format='table')
In [8]: store = pd.HDFStore('test.h5')
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])
In [10]: store.select('df',where='id=0')
Out[10]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
[4 rows x 2 columns]
This is 0.13 syntax, this is a bit more tricky in 0.12
In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]:
A B
id date
0 2013-01-01 00:20:00 -2.399376 -0.188057
1 2013-01-01 00:20:00 1.123922 -0.944170
2 2013-01-01 00:20:00 -1.202875 -0.747429
3 2013-01-01 00:20:00 -1.762597 -0.239100
[4 rows x 2 columns]
In [19]: store.close()
So for example to do a groupby on the id, you can select all of the unique ids (use the select_column
method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:
In [24]: ids = store.select_column('df','id').unique()
In [25]: ids
Out[25]: array([0, 1, 2, 3])
In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]:
0 1 2 3
A -2.655407 1.394721 -2.991917 -1.653539
B 0.618590 -1.462535 -3.724516 -0.047436
[2 rows x 4 columns]
A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'
You might find this example instructive as well here
Upvotes: 1