chasedcribbet
chasedcribbet

Reputation: 258

str.extract() with regex

I need to split a column into two using regex and str.extract() (assuming this is best)

    df = pd.DataFrame({
                        'Product': ['Truly Mix 2/12Pk Cans - 12Z',
                                    'Bud 16Z - LOOSE -  16Z',
                                    'Blue Moon (Case 12x - 22Z)',
                                    '2 for the show (6/4PK - 16Z)']
             })

I would like this result:

df_result = pd.DataFrame({
                          'Product': ['Truly Mix', 'Bud', 'Blue Moon', '2 for the show'],
                          'Packaging': ['2/12Pk Cans - 12Z',
                                        '16Z - LOOSE -  16Z',
                                        'Case 12x - 22Z',
                                        '6/4PK - 16Z' ]
                 })

I tried a lot of things, but still struggle with regex, even after lots of online learning.

Here is my final attempt at getting the product:

pattern = r'(\D+)[^\w][^(Case][^0-9]'

df['Product'] = df['Product'].str.extract(pattern)

str.replace() should work fine for getting rid of the parenthesis, just can't get that far.

I'm just not even close after 3 hours.

Upvotes: 1

Views: 1636

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You can extract the two parts of each entry into two columns, and then remove ( and ) at the start/end of the string where they are present:

import pandas as pd
df = pd.DataFrame({'Product': ['Truly Mix 2/12Pk Cans - 12Z','Bud 16Z - LOOSE -  16Z','Blue Moon (Case 12x - 22Z)','2 for the show (6/4PK - 16Z)']})
pattern = r'^(.*?)\s*((?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b).*)'
df[['Product', 'Packaging']] = df['Product'].str.extract(pattern, expand=True)
df['Packaging'] = df['Packaging'].str.replace(r'^\((.*)\)$', r'\1', regex=True)
# => >>> print(df['Packaging'])
#    0     2/12Pk Cans - 12Z
#    1    16Z - LOOSE -  16Z
#    2        Case 12x - 22Z
#    3           6/4PK - 16Z
# => >>> print(df['Product'])
#    0         Truly Mix
#    1               Bud
#    2         Blue Moon
#    3    2 for the show

See the regex demo. Regex details:

  • ^ - start of string
  • (.*?) - Group 1: any zero or more chars other than line break chars as few as possible -\s* - zero or more whitespaces
  • ((?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b).*) - Group 2:
    • (?:\((?:Case\b)?|\d+(?:/\d+)?[A-Za-z]+\b) - either of
      • \((?:Case\b)? - a ( and then an optional whole word Case
      • | - or
      • \d+(?:/\d+)?[A-Za-z]+\b - one or more digits, an optional sequence of / and one or more digits, and one or more letters (followed with a word boundary)
    • .* - any zero or more chars other than line break chars as many as possible

The .replace(r'^\((.*)\)$', r'\1', regex=True) part remove ( and ) at the start and end of the string where they are both present.

Upvotes: 1

Related Questions