Reputation: 258
I need to split a column into two using regex and str.extract()
(assuming this is best)
df = pd.DataFrame({
'Product': ['Truly Mix 2/12Pk Cans - 12Z',
'Bud 16Z - LOOSE - 16Z',
'Blue Moon (Case 12x - 22Z)',
'2 for the show (6/4PK - 16Z)']
})
I would like this result:
df_result = pd.DataFrame({
'Product': ['Truly Mix', 'Bud', 'Blue Moon', '2 for the show'],
'Packaging': ['2/12Pk Cans - 12Z',
'16Z - LOOSE - 16Z',
'Case 12x - 22Z',
'6/4PK - 16Z' ]
})
I tried a lot of things, but still struggle with regex, even after lots of online learning.
Here is my final attempt at getting the product:
pattern = r'(\D+)[^\w][^(Case][^0-9]'
df['Product'] = df['Product'].str.extract(pattern)
str.replace()
should work fine for getting rid of the parenthesis, just can't get that far.
I'm just not even close after 3 hours.
Upvotes: 1
Views: 1636
Reputation: 627488
You can extract the two parts of each entry into two columns, and then remove (
and )
at the start/end of the string where they are present:
import pandas as pd
df = pd.DataFrame({'Product': ['Truly Mix 2/12Pk Cans - 12Z','Bud 16Z - LOOSE - 16Z','Blue Moon (Case 12x - 22Z)','2 for the show (6/4PK - 16Z)']})
pattern = r'^(.*?)\s*((?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b).*)'
df[['Product', 'Packaging']] = df['Product'].str.extract(pattern, expand=True)
df['Packaging'] = df['Packaging'].str.replace(r'^\((.*)\)$', r'\1', regex=True)
# => >>> print(df['Packaging'])
# 0 2/12Pk Cans - 12Z
# 1 16Z - LOOSE - 16Z
# 2 Case 12x - 22Z
# 3 6/4PK - 16Z
# => >>> print(df['Product'])
# 0 Truly Mix
# 1 Bud
# 2 Blue Moon
# 3 2 for the show
See the regex demo. Regex details:
^
- start of string(.*?)
- Group 1: any zero or more chars other than line break chars as few as possible
-\s*
- zero or more whitespaces((?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b).*)
- Group 2:
(?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b)
- either of
\((?:Case\b)?
- a (
and then an optional whole word Case
|
- or\d+(?:/\d+)?[A-Za-z]+\b
- one or more digits, an optional sequence of /
and one or more digits, and one or more letters (followed with a word boundary).*
- any zero or more chars other than line break chars as many as possibleThe .replace(r'^\((.*)\)$', r'\1', regex=True)
part remove (
and )
at the start and end of the string where they are both present.
Upvotes: 1