Pandas how to filter for multiple substrings in series

Question

I would like to check if pandas dataframe column id contains the following substrings '.F1', '.N1', '.FW', '.SP'.

I am currently using the following codes:

searchfor = ['.F1', '.N1', '.FW', '.SP']
mask = (df["id"].str.contains('|'.join(searchfor)))

The id column looks like such:

                   ID
0  F611B4E369F1D293B5
1  10302389527F190F1A

I am actually looking to see if the id column contains the four substrings starting with a .. For some reasons, F1 will be filtered out. In the current example, it does not have .F1. I would really appreciate if someone would let me know how to solve this particular issue. Thank you so much.

SeaBean · Accepted Answer

You can use re.escape() to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor (no need to change the definition of searchfor):

import re

searchfor = ['.F1', '.N1', '.FW', '.SP']            # no need to escape each string

pattern = '|'.join(map(re.escape, searchfor))       # use re.escape() with map()

mask = (df["id"].str.contains(pattern))

re.escape() will escape each string for you:

print(pattern)

'\.F1|\.N1|\.FW|\.SP'

Pandas how to filter for multiple substrings in series

Answers (1)

Related Questions