Crimsoneer
Crimsoneer

Reputation: 53

Pandas Regex extract giving different output to re.search?

So, I'm trying to use regex to extract weight values from a column of my pandas dataframe...expect for some reason, it's not extract right.

all_data["name"].iloc[0] = "220 grams" # this is purely to show my issue

pattern  = "[0-9]+ ?(gram|mg|Gram|GRAM)"

gram_values = all_data["name"].str.contains(pattern)

re.search(pattern, all_data["name"].iloc[0])

Output is

<re.Match object; span=(0, 8), match='220 gram'>

As predicted, it's exporting the 220 gram. Hooray.

NOW, if I use the pandas.str.extract method...

all_data["name"].str.extract(pattern)

Then the output is

extracts "gram"

Same regex pattern, two different outputs. So what the hell am I doing wrong here? How can the regex string extract different values?

Upvotes: 1

Views: 619

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627101

Pandas Series.str.extract() behavior is explained in the documenation, it returns only the capturing group contents.

pat : string
Regular expression pattern with capturing groups

Your regex contains a single capturing group, (gram|mg|Gram|GRAM), so its contents are returned.

To make the regex work in Pandas str.extract, wrap it with a capturing group, and make the other group non-capturing:

r"([0-9]+ ?(?:gram|mg|Gram|GRAM))"
# |        |non-capturing group||
# |_______ capturing group______|

Upvotes: 2

Related Questions