Raquel
Raquel

Reputation: 121

Check contains condition conditionally by dataframe index in Pandas dataframe

I would like to conditionally check in a pandas dataframe if a string value contains some other string values, defined as a regex. The string values to check for change per row, and right now are stored in a series, with the formats displayed below:

df = pd.DataFrame(["a", "a", "b", "c", "de", "de"], columns=["Value"])

df:
| Index   | Value   |
|   0     | "a"     |
|   1     | "a"     |
|   2     | "b"     |
|   3     | "c"     |
|   4     | "de"    |
|   5     | "de"    |

series = pd.Series(["a|b|c", "a", "d|e", "c", "c|a", "f|e"])

Series with contains regex per row:
| Index   | Value   |
|   0     | "a|b|c" |
|   1     | "a"     |
|   2     | "d|e"   |
|   3     | "c"     |
|   4     | "c|a"   |
|   5     | "f|e"   |

Expected output of contains:
[True, True, False, True, False, True]

If I were doing an .isin() and only needed to check for an exact match of some values, I could simply do the following:

Series with isin list per row:
| Index   | Value   |
|   0     | [a, b]  |
|   1     | [a]     |
|   2     | [d, e]  |
|   3     | [c]     |
|   4     | [c, a]  |
|   5     | [d, e]  |

dataframe["value"].isin(series) 

since each row would be mapped to the right list to check by index by default

However, I need to check whether any of those values are contained, they do not need to be an exact match, so I need to use contains. I keep getting a "Series is not hashable" error when trying to do this:

dataframe["value"].str.lower().str.contains(series)

And I am not sure how could I make the contains function map to the actual regex to check per row. I would like to avoid lambdas and apply as much as possible, since I am processing a big dataset and I need execution to be as performant as possible

Thanks for the help,

Upvotes: 1

Views: 170

Answers (1)

mozway
mozway

Reputation: 261914

You need to use a loop/list comprehension here:

import re
out = [bool(re.search(pat, s)) for pat, s in zip(series, df['Value'])]

# or as new column in df:
# df['new'] = [bool(re.search(pat, s)) for pat, s in zip(series, df['Value'])]

output: [True, True, False, True, False, True]

Upvotes: 1

Related Questions