mgilbert
mgilbert

Reputation: 3665

Return a slice of an in-kernel query

I am trying to return a slice of an in-kernal pytables query without first returning the entire range and then taking [-1] since the size of the query is quite large. As an example I have data of the form

import tables

class Tick(tables.IsDescription):
    timestamp = tables.Int64Col()
    bid = tables.Float64Col()
    ask = tables.Float64Col()

h5file = tables.openFile('test.h5','w')
tbl = h5file.createTable('/', 'ticks', Tick)

rows = [(123, 1.34, 1.35),(127, 1.345, 1.355),(128, 1.35, 1.36)]
tick = tbl.row
for row in rows:
    tick['bid'] = row[1]
    tick['ask'] = row[2]
    tick['timestamp'] = row[0]
    tick.append()
tbl.flush()
h5file.close()

I would like to do something of the form

tbl.readWhere('tail 1 (timestamp <= 127)')

which would have the same effect as

tbl.readWhere('(timestamp <= 127)')[-1]

but be more effecient. I've looked at using the start/stop parameters but this does does the slice pre-conditional statement, whereas I need it post conditional.

The exact data is of the form

09/05/14 20:59:41,1.37580,1.37620
09/05/14 20:59:43,1.37584,1.37624
09/05/14 20:59:45,1.37580,1.37620
09/05/14 20:59:45,1.37578,1.37622
09/05/14 20:59:45,1.37574,1.37624
09/05/14 20:59:58,1.37574,1.37624

Upvotes: 1

Views: 97

Answers (1)

Jeff
Jeff

Reputation: 129018

Using Pytables 3.0.0 (and pandas 0.14.1, which interfaces with PyTables in a higher level manner; the generated tables are accessible from either).

In [1]: pd.set_option('max_rows',10)

In [2]: N = 100000000

In [3]: df = DataFrame(dict(A = np.random.randn(N), B = np.random.randn(N)), index=date_range('20130101',freq='ms',periods=N))
df

In [4]: df
Out[4]: 
                                   A         B
2013-01-01 00:00:00        -1.184339 -0.362050
2013-01-01 00:00:00.001000 -0.431403 -0.602782
2013-01-01 00:00:00.002000  0.582003  1.207553
2013-01-01 00:00:00.003000  0.208940 -0.507944
2013-01-01 00:00:00.004000 -1.402088 -0.502517
...                              ...       ...
2013-01-02 03:46:39.995000  1.815447 -0.050623
2013-01-02 03:46:39.996000  0.071673  1.138665
2013-01-02 03:46:39.997000 -0.778820 -0.280813
2013-01-02 03:46:39.998000  0.920727  0.570497
2013-01-02 03:46:39.999000 -1.205459  0.437231

[100000000 rows x 2 columns]

In [5]: df.to_hdf('test.hdf','df', mode='w',format='table',compress='blosc')

In [6]: pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")')
Out[6]: 
                                   A         B
2013-01-02 01:00:00.001000 -0.210051 -0.866118
2013-01-02 01:00:00.002000 -1.164465  0.388854
2013-01-02 01:00:00.003000  1.110326  0.925144
2013-01-02 01:00:00.004000  0.565132 -0.291035
2013-01-02 01:00:00.005000 -1.026886  0.047159
...                              ...       ...
2013-01-02 01:00:00.995000  0.280094 -1.080868
2013-01-02 01:00:00.996000 -1.394722 -0.523851
2013-01-02 01:00:00.997000  0.072997 -0.643343
2013-01-02 01:00:00.998000  0.721472  0.447951
2013-01-02 01:00:00.999000 -0.838169 -0.794621

[999 rows x 2 columns]

In [8]: %timeit pd.read_hdf('test.hdf','df',where='(index>"2013-01-02 01:00:00") and (index<"2013-01-02 01:00:01")').iloc[-1]
10 loops, best of 3: 31.6 ms per loop

My suggestion was to select a typical range bigger than you need. If it doesn't have any values then select a bigger one (as you stated that your timeseries could be slightly irregular). In any event, indexing is the key (pandas automatically will create the required indexes on the 'index'). Selection becomes quite efficient. The queries can be specified in a relatively high level way, see here

For reference here is the generated PyTables meta-data

In [5]: !ptdump -av test.hdf
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {'freq': <Milli>}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0']]
/df/table (Table(100000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (21845,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 100000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'datetime64',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]

Upvotes: 1

Related Questions