THM
THM

Reputation: 805

What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?

I'm looking for the PyTables counterpart of a SQL query in the form of...

SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)

...where the condition values are stored in a Python list/tuple or numpy array of arbitrary length. The search values are determined during runtime of the application by the user.

I've tried an approach using PyTables Conditions related with in-kernel search:

options = {
            'val' : 'someval1'
            'val' : 'someval2'
           }
for row in table.where('col1 == val', options):
    print row['col2']

but it's only picking the first value.

My main goal is to build the options dictionary or list dynamically and pass it to Pytables.

Upvotes: 1

Views: 518

Answers (2)

Francesc
Francesc

Reputation: 376

I think Anthony had a typo in the syntax for query. This should work:

In [1]: import numpy as np
In [2]: import tables as tb
In [3]: ra = np.fromiter(((i, i*2, i*3) for i in xrange(1000000)), dtype='i4,f4,f8')
In [4]: f = tb.open_file("mytable.h5", "w")
In [5]: t = f.create_table(f.root, "mytable", ra)
In [6]: cond = "|".join(["(f1 == %d)"%i for i in (2, 4, 10)])
In [7]: %timeit [r['f0'] for r in t.where(cond)]
100 loops, best of 3: 9.14 ms per loop

This speed is very close to my best attempt in numpy:

In [8]: %timeit np.where((ra['f1'] == 2)|(ra['f1'] == 4)|(ra['f1'] == 10))[0]
100 loops, best of 3: 8.32 ms per loop

And also, if you want to make use of the 'in' operator:

In [9]: %timeit [r['f0'] for r in t if r['f1'] in (2, 4, 10)]
10 loops, best of 3: 160 ms per loop

but in this case speed sucks.

Upvotes: 0

Anthony Scopatz
Anthony Scopatz

Reputation: 3637

This is because your options dictionary only has one element in it. The right way to do this is something like:

for row in table.where('col1 == val || col1 == val2'):
    print row['col2']

Upvotes: 1

Related Questions