justinvf
justinvf

Reputation: 3109

How to filter rows in pandas by regex

I would like to cleanly filter a dataframe using regex on one of the columns.

For a contrived example:

In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]: 
   a    b
0  1   hi
1  2  foo
2  3  fat
3  4  cat

I want to filter the rows to those that start with f using a regex. First go:

In [213]: foo.b.str.match('f.*')
Out[213]: 
0    []
1    ()
2    ()
3    []

That's not too terribly useful. However this will get me my boolean index:

In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]: 
0    False
1     True
2     True
3    False
Name: b

So I could then do my restriction by:

In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]: 
   a    b
1  2  foo
2  3  fat

That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?

Upvotes: 272

Views: 337228

Answers (10)

3nrique0
3nrique0

Reputation: 378

Here's a slightly different way.

Calling columns with df.col_name may be confusing for future you, some people prefere df['col_name']. Here are 2 steps for filtering your dataframe as desired. This allows to save all the rows.

  1. Makes Pandas series boolean
df['b'].str.startswith('f')
  1. Use that boolean series to filter your dataframe into a new dataframe
df_filt = df.loc[df['b'].str.startswith('f')]

Finally you can proceed to handle NaN values as best fits your needs.

  • Pandas help on missing data (check the propagation in arithmetic and comparison)
  • Or just check if needed if some missing data slipped by. This post is helpful

Upvotes: 0

sparrow
sparrow

Reputation: 11460

Building off of the great answer by user3136169, here is an example of how that might be done also removing NoneType values.

def regex_filter(val):
    if val:
        mo = re.search(regex,val)
        if mo:
            return True
        else:
            return False
    else:
        return False

df_filtered = df[df['col'].apply(regex_filter)]

You can also add regex as an arg:

def regex_filter(val,myregex):
    ...

df_filtered = df[df['col'].apply(regex_filter,regex=myregex)]

Upvotes: 25

rachwa
rachwa

Reputation: 2290

You can use query in combination with contains:

foo.query('b.str.contains("^f").values')

Alternatively you can also use startswith:

foo.query('b.str.startswith("f").values')

However I prefer the first alternative since it allows you to search for multiple patterns using the | operator.

Upvotes: 1

Martin Bucher
Martin Bucher

Reputation: 173

Using Python's built-in ability to write lambda expressions, we could filter by an arbitrary regex operation as follows:

import re  

# with foo being our pd dataframe
foo[foo['b'].apply(lambda x: True if re.search('^f', x) else False)]

By using re.search you can filter by complex regex style queries, which is more powerful in my opinion. (as str.contains is rather limited)

Also important to mention: You want your string to start with a small 'f'. By using the regex f.* you match your f on an arbitrary location within your text. By using the ^ symbol you explicitly state that you want it to be at the beginning of your content. So using ^f would probably be a better idea :)

Upvotes: 7

Michael Siler
Michael Siler

Reputation: 393

It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match. The docs explain the difference between match, fullmatch and contains.

Note that in order to use the results for indexing, set the na=False argument (or True if you want to include NANs in the results).

Upvotes: 29

Erkan Şirin
Erkan Şirin

Reputation: 2095

There is already a string handling function Series.str.startswith(). You should try foo[foo.b.str.startswith('f')].

Result:

    a   b
1   2   foo
2   3   fat

I think what you expect.

Alternatively you can use contains with regex option. For example:

foo[foo.b.str.contains('oo', regex= True, na=False)]

Result:

    a   b
1   2   foo

na=False is to prevent Errors in case there is nan, null etc. values

Upvotes: 58

waitingkuo
waitingkuo

Reputation: 93754

Use contains instead:

In [10]: df.b.str.contains('^f')
Out[10]: 
0    False
1     True
2     True
3    False
Name: b, dtype: bool

Upvotes: 290

BENY
BENY

Reputation: 323226

Using str slice

foo[foo.b.str[0]=='f']
Out[18]: 
   a    b
1  2  foo
2  3  fat

Upvotes: 2

tzviya
tzviya

Reputation: 568

Write a Boolean function that checks the regex and use apply on the column

foo[foo['b'].apply(regex_function)]

Upvotes: 14

lakshman senathirajah
lakshman senathirajah

Reputation: 227

Multiple column search with dataframe:

frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]

Upvotes: 21

Related Questions