Reputation: 3109
I would like to cleanly filter a dataframe using regex on one of the columns.
For a contrived example:
In [210]: foo = pd.DataFrame({'a' : [1,2,3,4], 'b' : ['hi', 'foo', 'fat', 'cat']})
In [211]: foo
Out[211]:
a b
0 1 hi
1 2 foo
2 3 fat
3 4 cat
I want to filter the rows to those that start with f
using a regex. First go:
In [213]: foo.b.str.match('f.*')
Out[213]:
0 []
1 ()
2 ()
3 []
That's not too terribly useful. However this will get me my boolean index:
In [226]: foo.b.str.match('(f.*)').str.len() > 0
Out[226]:
0 False
1 True
2 True
3 False
Name: b
So I could then do my restriction by:
In [229]: foo[foo.b.str.match('(f.*)').str.len() > 0]
Out[229]:
a b
1 2 foo
2 3 fat
That makes me artificially put a group into the regex though, and seems like maybe not the clean way to go. Is there a better way to do this?
Upvotes: 272
Views: 337228
Reputation: 378
Here's a slightly different way.
Calling columns with df.col_name
may be confusing for future you, some people prefere df['col_name']
. Here are 2 steps for filtering your dataframe as desired. This allows to save all the rows.
df['b'].str.startswith('f')
df_filt = df.loc[df['b'].str.startswith('f')]
Finally you can proceed to handle NaN values as best fits your needs.
Upvotes: 0
Reputation: 11460
Building off of the great answer by user3136169, here is an example of how that might be done also removing NoneType values.
def regex_filter(val):
if val:
mo = re.search(regex,val)
if mo:
return True
else:
return False
else:
return False
df_filtered = df[df['col'].apply(regex_filter)]
You can also add regex as an arg:
def regex_filter(val,myregex):
...
df_filtered = df[df['col'].apply(regex_filter,regex=myregex)]
Upvotes: 25
Reputation: 2290
You can use query
in combination with contains
:
foo.query('b.str.contains("^f").values')
Alternatively you can also use startswith
:
foo.query('b.str.startswith("f").values')
However I prefer the first alternative since it allows you to search for multiple patterns using the |
operator.
Upvotes: 1
Reputation: 173
Using Python's built-in ability to write lambda expressions, we could filter by an arbitrary regex operation as follows:
import re
# with foo being our pd dataframe
foo[foo['b'].apply(lambda x: True if re.search('^f', x) else False)]
By using re.search you can filter by complex regex style queries, which is more powerful in my opinion. (as str.contains
is rather limited)
Also important to mention: You want your string to start with a small 'f'. By using the regex f.*
you match your f on an arbitrary location within your text. By using the ^
symbol you explicitly state that you want it to be at the beginning of your content. So using ^f
would probably be a better idea :)
Upvotes: 7
Reputation: 393
It may be a bit late, but this is now easier to do in Pandas by calling Series.str.match
. The docs explain the difference between match
, fullmatch
and contains
.
Note that in order to use the results for indexing, set the na=False
argument (or True
if you want to include NANs in the results).
Upvotes: 29
Reputation: 2095
There is already a string handling function Series.str.startswith()
.
You should try foo[foo.b.str.startswith('f')]
.
Result:
a b
1 2 foo
2 3 fat
I think what you expect.
Alternatively you can use contains with regex option. For example:
foo[foo.b.str.contains('oo', regex= True, na=False)]
Result:
a b
1 2 foo
na=False
is to prevent Errors in case there is nan, null etc. values
Upvotes: 58
Reputation: 93754
Use contains instead:
In [10]: df.b.str.contains('^f')
Out[10]:
0 False
1 True
2 True
3 False
Name: b, dtype: bool
Upvotes: 290
Reputation: 323226
Using str
slice
foo[foo.b.str[0]=='f']
Out[18]:
a b
1 2 foo
2 3 fat
Upvotes: 2
Reputation: 568
Write a Boolean function that checks the regex and use apply on the column
foo[foo['b'].apply(regex_function)]
Upvotes: 14
Reputation: 227
Multiple column search with dataframe:
frame[frame.filename.str.match('*.'+MetaData+'.*') & frame.file_path.str.match('C:\test\test.txt')]
Upvotes: 21