Ginjj
Ginjj

Reputation: 13

Splitting Column in Pandas using Regex

My first question... I have a Pandas data-frame with a column 'Description'. The column has a reference and a name which I want to split into two columns. I have the 'Names' in a separate df:

#  Description                                   #  Names
---------------------------------------          ---------------
0  A long walk by Miss D'Bus                     0  Teresa Green
1  A day in the country by Teresa Green          1  Tim Burr
2  Falling Trees by Tim Burr                     2  Miss D'Bus
3  Evergreens by Teresa Green
4  Late for Dinner by Miss D'Bus

I have successfully searched the descriptions to determine if it has a matching name by using a regex string with all the names:

regex = '$|'.join(map(re.escape, df['Names'])) + '$' 
df['Reference'] = df['Description'].str.split(regex, expand=True)

to get

#  Description                                   Reference
-----------------------------------------------------------------------
0  A long walk by Miss D'Bus                     A long walk by
1  A day in the country by Teresa Green          A day in the country by
2  Falling Trees by Tim Burr                     Falling Trees by
3  Evergreens by Teresa Green                    Evergreens by
4  Late for Dinner by Miss D'Bus                 Late for Dinner by

But I want the corresponding (= removed delimiter) Name as an additional column.

It tried adding *? to the regex like this

I tried splitting the 'Description' column by using the 'Reference' column

df['Name'] = df['Description'].str.split(df['Reference'])

I tried slicing the 'Description' column by using the length of the 'Reference' string like

# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]

but i get a constant slice length.

Upvotes: 1

Views: 67

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626794

You may use Series.str.extract to get both types of information from the original column:

regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)

Output:

>>> df
                            Description                Reference          name
0             A long walk by Miss D'Bus           A long walk by    Miss D'Bus
1  A day in the country by Teresa Green  A day in the country by  Teresa Green
2             Falling Trees by Tim Burr         Falling Trees by      Tim Burr
3            Evergreens by Teresa Green            Evergreens by  Teresa Green
4         Late for Dinner by Miss D'Bus       Late for Dinner by    Miss D'Bus

The regex will look like ^(.*?)\s*(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)$:

  • ^ - start of string
  • (.*?) - Group 1 ("Reference"): any zero or more chars other than line break chars, as few as possible
  • \s* - 0+ whitespaces
  • (Teresa\ Green|Tim\ Burr|Miss\ D\'Bus) - Group 2 ("Name"): the alternation group with the known names
  • $ - end of string.

Upvotes: 2

Related Questions