Daina
Daina

Reputation: 431

Search and filter pandas dataframe with regular expressions

I'd appreciate your help. I have a pandas dataframe. I want to search 3 columns of the dataframe using a regular expression, then return all rows that meet the search criteria, sorted by one of my columns. I would like to write this as a function so I can implement this logic with other criteria if possible, but am not quite sure how to do this.

For example, I know how pull the results of a search thusly (with col1 being a column name):

idx1 = df.col1.str.contains(r'vhigh|high', flags=re.IGNORECASE, regex=True, na=False)

print df[~idx1]

but I can't figure out how to take this type of action, and perform it with multiple columns and then sort. Anyone have any tips?

Upvotes: 5

Views: 17622

Answers (1)

YS-L
YS-L

Reputation: 14738

You can use apply to make the code more concise. For example, given this DataFrame:

df = pd.DataFrame(
    {
        'col1': ['vhigh', 'low', 'vlow'],
        'col2': ['eee', 'low', 'high'],
        'val': [100,200,300]
    }
)
print df

Input:

    col1  col2  val
0  vhigh   eee  100
1    low   low  200
2   vlow  high  300

You can select all the rows that contain the strings vhigh or high in columns col1 or col2 as follow:

mask = df[['col1', 'col2']].apply(
    lambda x: x.str.contains(
        'vhigh|high',
        regex=True
    )
).any(axis=1)
print df[mask]

The apply function applies the contains function on each column (since by default axis=0). The any function returns a Boolean mask, with element True indicating that at least one of the columns met the search criteria. This can then be used to perform selection on the original DataFrame.

Output:

    col1  col2  val
0  vhigh   eee  100
2   vlow  high  300

Then, to sort the result by a column, e.g. the val column, you could simply do:

df[mask].sort('val')

Upvotes: 10

Related Questions