euforia
euforia

Reputation: 9275

Filter pandas DataFrame by substring criteria

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.

Something like this idiom:

re.search(pattern, cell_in_question) 

returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.

Upvotes: 919

Views: 1559013

Answers (18)

cottontail
cottontail

Reputation: 23351

query API

As mentioned in other answers, you can use query to filter rows by making a call to str.contains inside the expression. A good thing about it is that unlike boolean indexing, you won't get the pesky SettingWithCopyWarning after using it. You can also pass the pattern defined locally (or elsewhere) using @. Also useful kwargs:

  • case=False: performs case insensitive search
  • na=False: fill in False for missing values e.g. NaN, NA, None etc.
df = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baZ', pd.NA]})
pat = r'f|z'
df.query('col.str.contains(@pat, case=False, na=False)')    # case-insensitive and return False if NaN

# or pass it as `local_dict`
df.query('col.str.contains(@pattern, case=False, na=False)', local_dict={'pattern': r'f|z'})

As shown above, you can handle NaN values in the column by passing na=False. This is less error-prone (and faster) than converting the column to str dtype or doing some other boolean checks as done in some answers on this page.

Performance

Since Python string methods are not optimized, it's often faster to drop down to vanilla Python and perform whatever task you have using an explicit loop. So if you want good performance, use list comprehension rather than the vectorized str.contains. As you can see from the following benchmark (tested on Python 3.12.0 and pandas 2.1.1), str.contains while terse, is about 20% slower than a list comprehension (even with the ternary operator for NaN handling). Since str.contains is a loop implementation anyway, this gap exists for whatever size the DataFrame is.

import re
import pandas as pd
df = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baZ', pd.NA]*100000})
pat = re.compile(r'f|z', flags=re.I)

%timeit df[df['col'].str.contains(pat, na=False)]
# 375 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[[bool(pat.search(x)) if (x==x) is True else False for x in df['col'].tolist()]]
# 318 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

Garrett
Garrett

Reputation: 49886

Vectorized string methods (i.e. Series.str) let you do the following:

df[df['A'].str.contains("hello")]

This is available in pandas 0.8.1 and up.

Upvotes: 1433

usman Abbasi
usman Abbasi

Reputation: 107

df[df['A'].str.contains("hello", case=False)]

Upvotes: 3

cs95
cs95

Reputation: 402982

How do I select by partial string from a pandas DataFrame?

This post is meant for readers who want to

  • search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
  • search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
  • match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
  • match multiple whole words
  • Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)

...and would like to know more about what methods should be preferred over others.

(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)

Friendly disclaimer, this is post is long.


Basic Substring Search

# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1

      col
0     foo
1  foobar
2     bar
3     baz

str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.

Here is an example of regex-based search,

# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]

      col
1  foobar

Sometimes regex search is not required, so specify regex=False to disable it.

#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
   
      col
0     foo
1  foobar

Performance wise, regex search is slower than substring search:

df2 = pd.concat([df1] * 1000, ignore_index=True)

%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]

6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Avoid using regex-based search if you don't need it.

Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in

ValueError: cannot index with vector containing NA / NaN values

This is usually because of mixed data or NaNs in your object column,

s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')

0     True
1     True
2      NaN
3     True
4    False
5      NaN
dtype: object


s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError                                Traceback (most recent call last)

Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,

s.str.contains('foo|bar', na=False)

0     True
1     True
2    False
3     True
4    False
5    False
dtype: bool

How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:

# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)

       A      B
0   True   True
1   True  False
2  False   True
3   True  False
4  False  False
5  False  False

All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).

If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.


Multiple Substring Search

This is most easily achieved through a regex search using the regex OR pipe.

# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4

          col
0     foo abc
1  foobar xyz
2       bar32
3      baz 45

df4[df4['col'].str.contains(r'foo|baz')]

          col
0     foo abc
1  foobar xyz
3      baz 45

You can also create a list of terms, then join them:

terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]

          col
0     foo abc
1  foobar xyz
3      baz 45

Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...

. ^ $ * + ? { } [ ] \ | ( )

Then, you'll need to use re.escape to escape them:

import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]

          col
0     foo abc
1  foobar xyz
3      baz 45

re.escape has the effect of escaping the special characters so they're treated literally.

re.escape(r'.foo^')
# '\\.foo\\^'

Matching Entire Word(s)

By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).

For example,

df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3

                     col
0        the sky is blue
1  bluejay by the window
 

Now consider,

df3[df3['col'].str.contains('blue')]

                     col
0        the sky is blue
1  bluejay by the window

v/s

df3[df3['col'].str.contains(r'\bblue\b')]

               col
0  the sky is blue

Multiple Whole Word Search

Similar to the above, except we add a word boundary (\b) to the joined pattern.

p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]

       col
0  foo abc
3   baz 45

Where p looks like this,

p
# '\\b(?:foo|baz)\\b'

A Great Alternative: Use List Comprehensions!

Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.

Instead of,

df1[df1['col'].str.contains('foo', regex=False)]

Use the in operator inside a list comp,

df1[['foo' in x for x in df1['col']]]

       col
0  foo abc
1   foobar

Instead of,

regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]

Use re.compile (to cache your regex) + Pattern.search inside a list comp,

p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]

      col
1  foobar

If "col" has NaNs, then instead of

df1[df1['col'].str.contains(regex_pattern, na=False)]

Use,

def try_search(p, x):
    try:
        return bool(p.search(x))
    except TypeError:
        return False

p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]

      col
1  foobar
 

More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.

In addition to str.contains and list comprehensions, you can also use the following alternatives.

np.char.find
Supports substring searches (read: no regex) only.

df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]

          col
0     foo abc
1  foobar xyz

np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.

f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True,  True, False, False])

df1[f(df1['col'], 'foo')]

       col
0  foo abc
1   foobar

Regex solutions possible:

regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]

      col
1  foobar

DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.

df1.query('col.str.contains("foo")', engine='python')

      col
0     foo
1  foobar

More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.


Recommended Usage Precedence

  1. (First) str.contains, for its simplicity and ease handling NaNs and mixed data
  2. List comprehensions, for its performance (especially if your data is purely strings)
  3. np.vectorize
  4. (Last) df.query

Upvotes: 321

rachwa
rachwa

Reputation: 2310

Somewhat similar to @cs95's answer, but here you don't need to specify an engine:

df.query('A.str.contains("hello").values')

Upvotes: 3

sharon
sharon

Reputation: 4636

I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:

df[df["A"].str.contains("Hello|Britain")]

and got an error:

cannot index with vector containing NA / NaN values

but it worked perfectly when an "==True" condition was added, like this:

df[df['A'].str.contains("Hello|Britain")==True]

Upvotes: 395

GenDemo
GenDemo

Reputation: 761

My 2c worth:

I did the following:

sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
    np.where(sale_method['Sale Method'].isin(['PRIVATE']),
             'private',
             np.where(sale_method['Sale Method']
                      .str.contains('AUCTION'),
                      'auction',
                      'other'
             )
    )

Upvotes: 2

user2110417
user2110417

Reputation:

You can try considering them as string as :

df[df['A'].astype(str).str.contains("Hello|Britain")]

Upvotes: 16

Angeline Kingsteena
Angeline Kingsteena

Reputation: 111

Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:

mask = df['ENTITY'].str.contains('DM')

df = df.loc[~(mask)].copy(deep=True)

Upvotes: 9

Grant Shannon
Grant Shannon

Reputation: 5075

A more generalised example - if looking for parts of a word OR specific words in a string:

df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])

Specific parts of sentence or word:

searchfor = '.*cat.*hat.*|.*the.*dog.*'

Creat column showing the affected rows (can always filter out as necessary)

df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)

    col1             col2           TrueFalse
0   cat andhat       1000.0         True
1   hat              2000000.0      False
2   the small dog    1000.0         True
3   fog              330000.0       False
4   pet 3            30000.0        False

Upvotes: 5

cardamom
cardamom

Reputation: 7421

Should you need to do a case insensitive search for a string in a pandas dataframe column:

df[df['A'].str.contains("hello", case=False)]

Upvotes: 26

Serhii Kushchenko
Serhii Kushchenko

Reputation: 948

Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.

df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]

Warning. This method is relatively slow, albeit convenient.

Upvotes: 4

Katu
Katu

Reputation: 1564

Using contains didn't work well for my string with special characters. Find worked though.

df[df['A'].str.find("hello") != -1]

Upvotes: 5

xpeiro
xpeiro

Reputation: 760

There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:

df.filter(regex=".*STRING_YOU_LOOK_FOR.*")

This way let's you get the column you look for whatever the way is wrote.

( Obviusly, you have to write the proper regex expression for each case )

Upvotes: 2

Philipp Schwarz
Philipp Schwarz

Reputation: 20814

If anyone wonders how to perform a related problem: "Select column by partial string"

Use:

df.filter(like='hello')  # select columns which contain the word hello

And to select rows by partial string matching, pass axis=0 to filter:

# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)  

Upvotes: 66

Mike
Mike

Reputation: 7203

Say you have the following DataFrame:

>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
       a            b
0  hello  hello world
1   abcd         defg

You can always use the in operator in a lambda expression to create your filter.

>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0     True
1    False
dtype: bool

The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.

Upvotes: 23

Christian
Christian

Reputation: 299

Quick note: if you want to do selection based on a partial string contained in the index, try the following:

df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]

Upvotes: 29

euforia
euforia

Reputation: 9275

Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.

def stringSearchColumn_DataFrame(df, colName, regex):
    newdf = DataFrame()
    for idx, record in df[colName].iteritems():

        if re.search(regex, record):
            newdf = concat([df[df[colName] == record], newdf], ignore_index=True)

    return newdf

Upvotes: 6

Related Questions