Reputation: 9275
I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"]
but can't seem to find a way to do the same with a partial string match, say 'hello'
.
Upvotes: 919
Views: 1559013
Reputation: 23351
query
APIAs mentioned in other answers, you can use query
to filter rows by making a call to str.contains
inside the expression. A good thing about it is that unlike boolean indexing, you won't get the pesky SettingWithCopyWarning
after using it. You can also pass the pattern defined locally (or elsewhere) using @
. Also useful kwargs:
case=False
: performs case insensitive searchna=False
: fill in False
for missing values e.g. NaN, NA, None etc.df = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baZ', pd.NA]})
pat = r'f|z'
df.query('col.str.contains(@pat, case=False, na=False)') # case-insensitive and return False if NaN
# or pass it as `local_dict`
df.query('col.str.contains(@pattern, case=False, na=False)', local_dict={'pattern': r'f|z'})
As shown above, you can handle NaN values in the column by passing na=False
. This is less error-prone (and faster) than converting the column to str
dtype or doing some other boolean checks as done in some answers on this page.
Since Python string methods are not optimized, it's often faster to drop down to vanilla Python and perform whatever task you have using an explicit loop. So if you want good performance, use list comprehension rather than the vectorized str.contains
. As you can see from the following benchmark (tested on Python 3.12.0 and pandas 2.1.1), str.contains
while terse, is about 20% slower than a list comprehension (even with the ternary operator for NaN handling). Since str.contains
is a loop implementation anyway, this gap exists for whatever size the DataFrame is.
import re
import pandas as pd
df = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baZ', pd.NA]*100000})
pat = re.compile(r'f|z', flags=re.I)
%timeit df[df['col'].str.contains(pat, na=False)]
# 375 ms ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[[bool(pat.search(x)) if (x==x) is True else False for x in df['col'].tolist()]]
# 318 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2
Reputation: 49886
Vectorized string methods (i.e. Series.str
) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
Upvotes: 1433
Reputation: 402982
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
df1[df1['col'].str.contains(r'foo(?!$)')]
isin
), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
df3[df3['col'].str.contains(r'\bblue\b')]
str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains
can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False
to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueError
s
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False
to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply
:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply
method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes
.
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape
to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape
has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b
).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Similar to the above, except we add a word boundary (\b
) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p
looks like this,
p
# '\\b(?:foo|baz)\\b'
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in
operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile
(to cache your regex) + Pattern.search
inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
np.char.find
, np.vectorize
, DataFrame.query
.In addition to str.contains
and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str
methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query
and eval
family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
str.contains
, for its simplicity and ease handling NaNs and mixed datanp.vectorize
df.query
Upvotes: 321
Reputation: 2310
Somewhat similar to @cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
Upvotes: 3
Reputation: 4636
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
Upvotes: 395
Reputation: 761
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
Upvotes: 2
Reputation:
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Upvotes: 16
Reputation: 111
Suppose we have a column named "ENTITY" in the dataframe df
. We can filter our df
,to have the entire dataframe df
, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Upvotes: 9
Reputation: 5075
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Upvotes: 5
Reputation: 7421
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Upvotes: 26
Reputation: 948
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Upvotes: 4
Reputation: 1564
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
Upvotes: 5
Reputation: 760
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
Upvotes: 2
Reputation: 20814
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0
to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Upvotes: 66
Reputation: 7203
Say you have the following DataFrame
:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in
operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1
option in the apply
to pass elements to the lambda function row by row, as opposed to column by column.
Upvotes: 23
Reputation: 299
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Upvotes: 29
Reputation: 9275
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Upvotes: 6