Reputation: 3665
I am trying to return a slice of an in-kernal pytables query without first returning the entire range and then taking [-1] since the size of the query is quite large. As an example I have data of the form
import tables
class Tick(tables.IsDescription):
timestamp = tables.Int64Col()
bid = tables.Float64Col()
ask = tables.Float64Col()
h5file = tables.openFile('test.h5','w')
tbl = h5file.createTable('/', 'ticks', Tick)
rows = [(123, 1.34, 1.35),(127, 1.345, 1.355),(128, 1.35, 1.36)]
tick = tbl.row
for row in rows:
tick['bid'] = row[1]
tick['ask'] = row[2]
tick['timestamp'] = row[0]
tick.append()
tbl.flush()
h5file.close()
I would like to do something of the form
tbl.readWhere('tail 1 (timestamp <= 127)')
which would have the same effect as
tbl.readWhere('(timestamp <= 127)')[-1]
but be more effecient. I've looked at using the start/stop parameters but this does does the slice pre-conditional statement, whereas I need it post conditional.
The exact data is of the form
09/05/14 20:59:41,1.37580,1.37620
09/05/14 20:59:43,1.37584,1.37624
09/05/14 20:59:45,1.37580,1.37620
09/05/14 20:59:45,1.37578,1.37622
09/05/14 20:59:45,1.37574,1.37624
09/05/14 20:59:58,1.37574,1.37624
Upvotes: 1
Views: 97
Reputation: 129018
Using Pytables 3.0.0 (and pandas 0.14.1, which interfaces with PyTables in a higher level manner; the generated tables are accessible from either).
In [1]: pd.set_option('max_rows',10)
In [2]: N = 100000000
In [3]: df = DataFrame(dict(A = np.random.randn(N), B = np.random.randn(N)), index=date_range('20130101',freq='ms',periods=N))
df
In [4]: df
Out[4]:
A B
2013-01-01 00:00:00 -1.184339 -0.362050
2013-01-01 00:00:00.001000 -0.431403 -0.602782
2013-01-01 00:00:00.002000 0.582003 1.207553
2013-01-01 00:00:00.003000 0.208940 -0.507944
2013-01-01 00:00:00.004000 -1.402088 -0.502517
... ... ...
2013-01-02 03:46:39.995000 1.815447 -0.050623
2013-01-02 03:46:39.996000 0.071673 1.138665
2013-01-02 03:46:39.997000 -0.778820 -0.280813
2013-01-02 03:46:39.998000 0.920727 0.570497
2013-01-02 03:46:39.999000 -1.205459 0.437231
[100000000 rows x 2 columns]
In [5]: df.to_hdf('test.hdf','df', mode='w',format='table',compress='blosc')
In [6]: pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")')
Out[6]:
A B
2013-01-02 01:00:00.001000 -0.210051 -0.866118
2013-01-02 01:00:00.002000 -1.164465 0.388854
2013-01-02 01:00:00.003000 1.110326 0.925144
2013-01-02 01:00:00.004000 0.565132 -0.291035
2013-01-02 01:00:00.005000 -1.026886 0.047159
... ... ...
2013-01-02 01:00:00.995000 0.280094 -1.080868
2013-01-02 01:00:00.996000 -1.394722 -0.523851
2013-01-02 01:00:00.997000 0.072997 -0.643343
2013-01-02 01:00:00.998000 0.721472 0.447951
2013-01-02 01:00:00.999000 -0.838169 -0.794621
[999 rows x 2 columns]
In [8]: %timeit pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")').iloc[-1]
10 loops, best of 3: 31.6 ms per loop
My suggestion was to select a typical range bigger than you need. If it doesn't have any values then select a bigger one (as you stated that your timeseries could be slightly irregular). In any event, indexing is the key (pandas automatically will create the required indexes on the 'index'). Selection becomes quite efficient. The queries can be specified in a relatively high level way, see here
For reference here is the generated PyTables meta-data
In [5]: !ptdump -av test.hdf
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {'freq': <Milli>}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0']]
/df/table (Table(100000000,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (21845,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df/table._v_attrs (AttributeSet), 11 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 100000000,
TITLE := '',
VERSION := '2.7',
index_kind := 'datetime64',
values_block_0_dtype := 'float64',
values_block_0_kind := ['A', 'B']]
Upvotes: 1