Reputation: 2905
Suppose I have a pandas
Series
object, and I want to take all elements (meaning indices) whose corresponding values obey some condition.
There are many possible ways to do it, but I'd expect there to be a simple, efficient, idiomatic way - which I haven't found.
This question describes how to do it with boolean indexing, but this seems overly verbose for a simple command - for example:
import pandas as pd
age = pd.Series(index=['mom','dad','cat1','cat2','baby'],
data=[30,30,3,3,1])
age[age>10].index.values
[EDITED: Do note that the variable name age
appears twice in the previous row. of course age[age>10]
is very short, but this is just because age
is a short name - and if I'm encountering series with long names, like for example age_of_family_members_after_filtering
, then age_of_family_members_after_filtering[age_of_family_members_after_filtering>10]
wouldn't look so good.
The other solutions that I found are similarly verbose:
age.where(lambda x: x>10).dropna().index.values
or:
[name for name, _age in age.items() if _age>10]
(the last one returns a list while the previous ones return arrays, but both are okay with me)
Since it's a very common command, I'd expect something like age.filter_where(lambda x: x>10)
or something like this, and I'm surprised not to find one.
What am I missing (if at all)? Thanks in advance.
Upvotes: 1
Views: 86
Reputation: 1919
the row slicing in pandas accepts a callable. Therefore you can do
age.loc[lambda x: x > 10]
It looks a bit too much for this small example but:
age
but series_long_after_operation
this become a lot clearerage.loc[lambda x: x > 10].loc[lambda x: x%2==0]
The second one is really the way to go for long piping operations where every method returns a different shape dataframe.
Upvotes: 2
Reputation: 2905
In my searches, I've found that for a dataframe (no equivalent for series yet) you can avoid the double call to a variable name and (which, again, might be a readability issue - depending on the coding conventions where you're at) by using .query
(which is probably much worse than the accepted solution performance-wise, but still worth to note:
import pandas as pd
df = pd.DataFrame(index=['mom','dad','cat1','cat2','baby'],
data=[30,30,3,3,1],
columns='age')
df.query('age>10')
results in
age
mom 30
dad 30
Upvotes: 1
Reputation: 1654
For the given solutions you can make some evaluation using the jupyter timeit magic command to simply test it:
# %%
%timeit age[age>10].index.values
--> 235 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# %%
%timeit age.where(lambda x: x>10).dropna().index.values
--> 510 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# %%
%timeit [name for name, _age in age.items() if _age>10]
--> 12.5 µs ± 429 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Means for the given solutions the last one is the fastest, but the first one is the simplest and still perfectly valid one.
Another one, note the efficiency difference:
age.index[age.values > 10].tolist()
--> 16.5 µs ± 823 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
age.index[age > 10].tolist()
--> 157 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
UPDATE with @Alexander's idea:
# %%
from itertools import compress
%timeit list(compress(age.index, age > 10))
--> 119 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Upvotes: 2
Reputation: 109756
You could compress the index, but I don't believe it is any easier than simple boolean indexing which is quite concise IMO.
from itertools import compress
>> list(compress(age.index, age > 10))
['mom', 'data']
Upvotes: 2