Reputation: 121
I would like to conditionally check in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:
df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])
df:
| Index | Value |
| 0 | "a" |
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "de" |
| 5 | "de" |
series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])
Series with contains regex per row:
| Index | Value |
| 0 | "a|b|c" |
| 1 | "a" |
| 2 | "d|e" |
| 3 | "c" |
| 4 | "c|a" |
| 5 | "f|e" |
Expected output of contains:
[True, True, False, True, False, True]
If I were doing an .isin() and only needed to check for an exact match of some values, I could simply do the following:
Series with isin list per row:
| Index | Value |
| 0 | [a, b] |
| 1 | [a] |
| 2 | [d, e] |
| 3 | [c] |
| 4 | [c, a] |
| 5 | [d, e] |
dataframe["value"].isin(series)
since each row would be mapped to the right list to check by index by default
However, I need to check whether any of those values are contained, they do not need to be an exact match, so I need to use contains. I keep getting a "Series is not hashable" error when trying to do this:
dataframe["value"].str.lower().str.contains(series)
And I am not sure how could I make the contains function map to the actual regex to check per row. I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible
Thanks for the help,
Upvotes: 1
Views: 170
Reputation: 261914
You need to use a loop/list comprehension here:
import re
out = [bool(re.search(pat, s)) for pat, s in zip(series, df['Value'])]
# or as new column in df:
# df['new'] = [bool(re.search(pat, s)) for pat, s in zip(series, df['Value'])]
output: [True, True, False, True, False, True]
Upvotes: 1