Reputation: 96360
Say, for the sake of an example, that I have several columns encoding different types of rates ("annual rate"
, "1/2 annual rate"
, etc.). I want to use query
on my dataframe to find entries where any of these rates is above 1
.
First I find the columns that I want to use in my query:
cols = [x for ix, x in enumerate(df.columns) if 'rate' in x]
where, say, cols
contains:
["annual rate", "1/2 annual rate", "monthly rate"]
I then want to do something like:
df.query('any of my cols > 1')
How can I format this for query
?
Upvotes: 2
Views: 820
Reputation: 25672
query
performs a full parse of a Python expression (with some limits, e.g., you can't use lambda
expressions or ternary if
/else
expressions). This means that any columns that you refer to in your query string must be a valid Python identifier (a more formal word for "variable name"). One way to check this is to use the Name
pattern lurking in the tokenize
module:
In [156]: tokenize.Name
Out[156]: '[a-zA-Z_]\\w*'
In [157]: def isidentifier(x):
.....: return re.match(tokenize.Name, x) is not None
.....:
In [158]: isidentifier('adsf')
Out[158]: True
In [159]: isidentifier('1adsf')
Out[159]: False
Now since your column names have spaces, each word separated by spaces will be evaluated as separate identifier so you'll have something like
df.query("annual rate > 1")
which is invalid Python syntax. Try typing annual rate
into a Python interpreter and you'll get a SyntaxError
exception.
Take home message: rename your columns to be valid variable names. You won't be able to do this programmatically (at least, easily) unless your columns follow some kind of structure. In your case you could do
In [166]: cols
Out[166]: ['annual rate', '1/2 annual rate', 'monthly rate']
In [167]: list(map(lambda x: '_'.join(x.split()).replace('1/2', 'half'), cols))
Out[167]: ['annual_rate', 'half_annual_rate', 'monthly_rate']
Then you can format the query string similar to @acushner's example
In [173]: newcols
Out[173]: ['annual_rate', 'half_annual_rate', 'monthly_rate']
In [174]: ' or '.join('%s > 1' % c for c in newcols)
Out[174]: 'annual_rate > 1 or half_annual_rate > 1 or monthly_rate > 1'
query
here:In [180]: df = DataFrame(randn(10, 3), columns=cols)
In [181]: df
Out[181]:
annual rate 1/2 annual rate monthly rate
0 -0.6980 0.6322 2.5695
1 -0.1413 -0.3285 -0.9856
2 0.8189 0.7166 -1.4302
3 1.3300 -0.9596 -0.8934
4 -1.7545 -0.9635 2.8515
5 -1.1389 0.1055 0.5423
6 0.2788 -1.3973 -0.9073
7 -1.8570 1.3781 0.0501
8 -0.6842 -0.2012 -0.5083
9 -0.3270 -1.5280 0.2251
[10 rows x 3 columns]
In [182]: df.gt(1).any(1)
Out[182]:
0 True
1 False
2 False
3 True
4 True
5 False
6 False
7 True
8 False
9 False
dtype: bool
In [183]: df[df.gt(1).any(1)]
Out[183]:
annual rate 1/2 annual rate monthly rate
0 -0.6980 0.6322 2.5695
3 1.3300 -0.9596 -0.8934
4 -1.7545 -0.9635 2.8515
7 -1.8570 1.3781 0.0501
[4 rows x 3 columns]
As @Jeff noted in the comments you can refer to non-identifier column names, albeit in a clunky way:
pd.eval('df[df["annual rate"]>0]')
I wouldn't recommended writing code like this if you want to save the lives of kittens.
Upvotes: 5
Reputation: 9946
something like this should do the trick
df.query('|'.join('(%s > 1)' % col for col in cols))
i'm not sure how to deal with spaces in column names though, so you might have to rename them.
Upvotes: 1