pythonpandasdataframefilteringsql-function

Reputation: 78923

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

How can I achieve the equivalents of SQL's IN and NOT IN?

I have a list with the required values. Here's the scenario:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['UK', 'China']

# pseudo-code:
df[df['country'] not in countries_to_keep]

My current way of doing this is as follows:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
df2 = pd.DataFrame({'country': ['UK', 'China'], 'matched': True})

# IN
df.merge(df2, how='inner', on='country')

# NOT IN
not_in = df.merge(df2, how='left', on='country')
not_in = not_in[pd.isnull(not_in['matched'])]

But this seems like a horrible kludge. Can anyone improve on it?

Upvotes: 890

Answers (12)

MaxU - stand with Ukraine

Reputation: 210942

Alternative solution that uses .query() method:

In [5]: df.query("country in @countries_to_keep")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("country not in @countries_to_keep")
Out[6]:
  countries
0        US
2   Germany

Upvotes: 174

cottontail

Reputation: 23331

`groupby` could be used to "split" a dataframe into two

If the aim is to divide a dataframe into two dataframes where one has the countries to keep and the other doesn't, the boolean mask created by the isin call can be used in a groupby call to split the dataframe into two: have and have-not.

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China'], 'value': range(4)})
countries_to_keep = ['UK', 'China']
df1, df2 = [g for _, g in df.groupby(df['country'].isin(countries_to_keep))]

`eval()` could also be used

query() suggested elsewhere evaluates a numerical expression. A related method is eval(). It can be used to create a boolean mask and filter a frame. It can be used with other masks perhaps created elsewhere for a more flexible filtering.

msk = df.eval('country in @countries_to_keep')
to_keep = df[msk]     # in
not_keep = df[~msk]   # not in

One special case where this is useful is, if you want to filter a single column using a condition, query is very memory inefficient because it creates a copy of the filtered frame, which will need to be filtered again for a single column whereas loc selects the column in one go using a boolean mask-column label combo. eval() can do the same.¹

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']*25000})
df[[f"col{i}" for i in range(50)]] = np.random.rand(100000, 50)
countries_to_keep = ['UK', 'China']

filtered = df.loc[df.eval('country==@countries_to_keep'), 'col1']

¹ A memory profiler test:

import numpy as np
import pandas as pd
%load_ext memory_profiler

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']*25000})
df[[f"col{i}" for i in range(50)]] = np.random.rand(100000, 50)
countries_to_keep = ['UK', 'China']


%memit x = df.loc[df.eval('country==@countries_to_keep'), 'col1']
# peak memory: 157.28 MiB, increment: 5.44 MiB

%memit y = df.query('country==@countries_to_keep')['col1']
# peak memory: 195.39 MiB, increment: 38.11 MiB

%memit z = df.loc[df['country'].isin(countries_to_keep), 'col1']
# peak memory: 176.93 MiB, increment: 0.76 MiB

Upvotes: 1

DSM

Reputation: 353449

You can use pd.Series.isin.

For "IN" use: something.isin(somewhere)

Or for "NOT IN": ~something.isin(somewhere)

As a worked example:

>>> df
    country
0        US
1        UK
2   Germany
3     China
>>> countries_to_keep
['UK', 'China']
>>> df.country.isin(countries_to_keep)
0    False
1     True
2    False
3     True
Name: country, dtype: bool
>>> df[df.country.isin(countries_to_keep)]
    country
1        UK
3     China
>>> df[~df.country.isin(countries_to_keep)]
    country
0        US
2   Germany

Upvotes: 1538

rachwa

Reputation: 2310

You can also use .isin() inside .query():

df.query('country.isin(@countries_to_keep).values')

# Or alternatively:
df.query('country.isin(["UK", "China"]).values')

To negate your query, use ~:

df.query('~country.isin(@countries_to_keep).values')

Update:

Another way is to use comparison operators:

df.query('country == @countries_to_keep')

# Or alternatively:
df.query('country == ["UK", "China"]')

And to negate the query, use !=:

df.query('country != @countries_to_keep')

Upvotes: 7

padu

Reputation: 899

Why is no one talking about the performance of various filtering methods? In fact, this topic often pops up here (see the example). I did my own performance test for a large data set. It is very interesting and instructive.

df = pd.DataFrame({'animals': np.random.choice(['cat', 'dog', 'mouse', 'birds'], size=10**7), 
                   'number': np.random.randint(0,100, size=(10**7,))})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   animals  object
 1   number   int64 
dtypes: int64(1), object(1)
memory usage: 152.6+ MB

%%timeit
# .isin() by one column
conditions = ['cat', 'dog']
df[df.animals.isin(conditions)]

367 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# .query() by one column
conditions = ['cat', 'dog']
df.query('animals in @conditions')

395 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# .loc[]
df.loc[(df.animals=='cat')|(df.animals=='dog')]

987 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df[df.apply(lambda x: x['animals'] in ['cat', 'dog'], axis=1)]

41.9 s ± 490 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
new_df = df.set_index('animals')
new_df.loc[['cat', 'dog'], :]

3.64 s ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
new_df = df.set_index('animals')
new_df[new_df.index.isin(['cat', 'dog'])]

469 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
s = pd.Series(['cat', 'dog'], name='animals')
df.merge(s, on='animals', how='inner')

796 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thus, the isin method turned out to be the fastest and the method with apply() was the slowest, which is not surprising.

Upvotes: 13

GenDemo

Reputation: 761

My 2c worth: I needed a combination of in and ifelse statements for a dataframe, and this worked for me.

sale_method = pd.DataFrame(model_data["Sale Method"].str.upper())
sale_method["sale_classification"] = np.where(
    sale_method["Sale Method"].isin(["PRIVATE"]),
    "private",
    np.where(
        sale_method["Sale Method"].str.contains("AUCTION"), "auction", "other"
    ),
)

Upvotes: 0

Billy Bonaros

Reputation: 1721

A trick if you want to keep the order of the list:

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['Germany', 'US']


ind=[df.index[df['country']==i].tolist() for i in countries_to_keep]
flat_ind=[item for sublist in ind for item in sublist]

df.reindex(flat_ind)

   country
2  Germany
0       US

Upvotes: 3

cs95

Reputation: 402942

How to implement 'in' and 'not in' for a pandas DataFrame?

Pandas offers two methods: Series.isin and DataFrame.isin for Series and DataFrames, respectively.

Filter DataFrame Based on ONE Column (also applies to Series)

The most common scenario is applying an isin condition on a specific column to filter rows in a DataFrame.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin accepts various types as inputs. The following are all valid ways of getting what you want:

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

Filter on MANY Columns

Sometimes, you will want to apply an 'in' membership check with some search terms over multiple columns,

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

To apply the isin condition to both columns "A" and "B", use DataFrame.isin:

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

From this, to retain rows where at least one column is True, we can use any along the first axis:

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

Note that if you want to search every column, you'd just omit the column selection step and do

df2.isin(c1).any(axis=1)

Similarly, to retain rows where ALL columns are True, use all in the same manner as before.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

Notable Mentions: `numpy.isin`, `query`, list comprehensions (string data)

In addition to the methods described above, you can also use the numpy equivalent: numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

Why is it worth considering? NumPy functions are usually a bit faster than their pandas equivalents because of lower overhead. Since this is an elementwise operation that does not depend on index alignment, there are very few situations where this method is not an appropriate replacement for pandas' isin.

Pandas routines are usually iterative when working with strings, because string operations are hard to vectorise. There is a lot of evidence to suggest that list comprehensions will be faster here.. We resort to an in check now.

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

It is a lot more unwieldy to specify, however, so don't use it unless you know what you're doing.

Lastly, there's also DataFrame.query which has been covered in this answer. numexpr FTW!

Upvotes: 97

Abhishek Gaur

Reputation: 779

Collating possible solutions from the answers:

For IN: df[df['A'].isin([3, 6])]

For NOT IN:

df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]

Upvotes: 18

Sam Henderson

Reputation: 489

I wanted to filter out dfbc rows that had a BUSINESS_ID that was also in the BUSINESS_ID of dfProfilesBusIds

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

Upvotes: 12

Ioannis Nasios

Reputation: 8527

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

implement in:

df[df.countries.isin(countries)]

implement not in as in of rest countries:

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

Upvotes: 5

Kos

Reputation: 72299

I've been usually doing generic filtering over rows like this:

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

Upvotes: 20

How to filter Pandas dataframe using &#39;in&#39; and &#39;not in&#39; like in SQL

Answers (12)

groupby could be used to "split" a dataframe into two

eval() could also be used