Itamar Mushkin
Itamar Mushkin

Reputation: 2905

How to (efficiently, idiomatically) take elements from pandas series based on condition

Suppose I have a pandas Series object, and I want to take all elements (meaning indices) whose corresponding values obey some condition.

There are many possible ways to do it, but I'd expect there to be a simple, efficient, idiomatic way - which I haven't found.

This question describes how to do it with boolean indexing, but this seems overly verbose for a simple command - for example:

import pandas as pd

age = pd.Series(index=['mom','dad','cat1','cat2','baby'],
                data=[30,30,3,3,1])

age[age>10].index.values

[EDITED: Do note that the variable name age appears twice in the previous row. of course age[age>10] is very short, but this is just because age is a short name - and if I'm encountering series with long names, like for example age_of_family_members_after_filtering, then age_of_family_members_after_filtering[age_of_family_members_after_filtering>10] wouldn't look so good.

The other solutions that I found are similarly verbose:

age.where(lambda x: x>10).dropna().index.values

or:

[name for name, _age in age.items() if _age>10]

(the last one returns a list while the previous ones return arrays, but both are okay with me)

Since it's a very common command, I'd expect something like age.filter_where(lambda x: x>10) or something like this, and I'm surprised not to find one.

What am I missing (if at all)? Thanks in advance.

Upvotes: 1

Views: 86

Answers (4)

DeanLa
DeanLa

Reputation: 1919

the row slicing in pandas accepts a callable. Therefore you can do

age.loc[lambda x: x > 10]

It looks a bit too much for this small example but:

  • if the series name is not age but series_long_after_operation this become a lot clearer
  • it supports method chaining like age.loc[lambda x: x > 10].loc[lambda x: x%2==0]

The second one is really the way to go for long piping operations where every method returns a different shape dataframe.

Upvotes: 2

Itamar Mushkin
Itamar Mushkin

Reputation: 2905

In my searches, I've found that for a dataframe (no equivalent for series yet) you can avoid the double call to a variable name and (which, again, might be a readability issue - depending on the coding conventions where you're at) by using .query (which is probably much worse than the accepted solution performance-wise, but still worth to note:

import pandas as pd

df = pd.DataFrame(index=['mom','dad','cat1','cat2','baby'],
                data=[30,30,3,3,1],
               columns='age')

df.query('age>10')

results in

    age
mom 30
dad 30

Upvotes: 1

Albo
Albo

Reputation: 1654

For the given solutions you can make some evaluation using the jupyter timeit magic command to simply test it:

# %%
%timeit age[age>10].index.values
--> 235 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# %%
%timeit age.where(lambda x: x>10).dropna().index.values
--> 510 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# %%
%timeit [name for name, _age in age.items() if _age>10]
--> 12.5 µs ± 429 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Means for the given solutions the last one is the fastest, but the first one is the simplest and still perfectly valid one.

Another one, note the efficiency difference:

age.index[age.values > 10].tolist()
--> 16.5 µs ± 823 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

age.index[age > 10].tolist()
--> 157 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


UPDATE with @Alexander's idea:

# %% 
from itertools import compress
%timeit list(compress(age.index, age > 10))
--> 119 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Upvotes: 2

Alexander
Alexander

Reputation: 109756

You could compress the index, but I don't believe it is any easier than simple boolean indexing which is quite concise IMO.

from itertools import compress

>> list(compress(age.index, age > 10))
['mom', 'data']

Upvotes: 2

Related Questions