Amelio Vazquez-Reina
Amelio Vazquez-Reina

Reputation: 96360

Using dynamic lists in query in Pandas

Say, for the sake of an example, that I have several columns encoding different types of rates ("annual rate", "1/2 annual rate", etc.). I want to use query on my dataframe to find entries where any of these rates is above 1.

First I find the columns that I want to use in my query:

cols = [x for ix, x in enumerate(df.columns) if 'rate' in x]

where, say, cols contains:

["annual rate", "1/2 annual rate", "monthly rate"]

I then want to do something like:

df.query('any of my cols > 1')

How can I format this for query?

Upvotes: 2

Views: 820

Answers (2)

Phillip Cloud
Phillip Cloud

Reputation: 25672

query performs a full parse of a Python expression (with some limits, e.g., you can't use lambda expressions or ternary if/else expressions). This means that any columns that you refer to in your query string must be a valid Python identifier (a more formal word for "variable name"). One way to check this is to use the Name pattern lurking in the tokenize module:

In [156]: tokenize.Name
Out[156]: '[a-zA-Z_]\\w*'

In [157]: def isidentifier(x):
   .....:     return re.match(tokenize.Name, x) is not None
   .....:

In [158]: isidentifier('adsf')
Out[158]: True

In [159]: isidentifier('1adsf')
Out[159]: False

Now since your column names have spaces, each word separated by spaces will be evaluated as separate identifier so you'll have something like

df.query("annual rate > 1")

which is invalid Python syntax. Try typing annual rate into a Python interpreter and you'll get a SyntaxError exception.

Take home message: rename your columns to be valid variable names. You won't be able to do this programmatically (at least, easily) unless your columns follow some kind of structure. In your case you could do

In [166]: cols
Out[166]: ['annual rate', '1/2 annual rate', 'monthly rate']

In [167]: list(map(lambda x: '_'.join(x.split()).replace('1/2', 'half'), cols))
Out[167]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

Then you can format the query string similar to @acushner's example

In [173]: newcols
Out[173]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

In [174]: ' or '.join('%s > 1' % c for c in newcols)
Out[174]: 'annual_rate > 1 or half_annual_rate > 1 or monthly_rate > 1'

Note: You don't actually need to use query here:

In [180]: df = DataFrame(randn(10, 3), columns=cols)

In [181]: df
Out[181]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
1      -0.1413          -0.3285       -0.9856
2       0.8189           0.7166       -1.4302
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
5      -1.1389           0.1055        0.5423
6       0.2788          -1.3973       -0.9073
7      -1.8570           1.3781        0.0501
8      -0.6842          -0.2012       -0.5083
9      -0.3270          -1.5280        0.2251

[10 rows x 3 columns]

In [182]: df.gt(1).any(1)
Out[182]:
0     True
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8    False
9    False
dtype: bool

In [183]: df[df.gt(1).any(1)]
Out[183]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
7      -1.8570           1.3781        0.0501

[4 rows x 3 columns]

As @Jeff noted in the comments you can refer to non-identifier column names, albeit in a clunky way:

pd.eval('df[df["annual rate"]>0]')

I wouldn't recommended writing code like this if you want to save the lives of kittens.

Upvotes: 5

acushner
acushner

Reputation: 9946

something like this should do the trick

df.query('|'.join('(%s > 1)' % col for col in cols))

i'm not sure how to deal with spaces in column names though, so you might have to rename them.

Upvotes: 1

Related Questions