Reputation: 13
My first question... I have a Pandas data-frame with a column 'Description'
. The column has a reference and a name which I want to split into two columns. I have the 'Names'
in a separate df:
# Description # Names
--------------------------------------- ---------------
0 A long walk by Miss D'Bus 0 Teresa Green
1 A day in the country by Teresa Green 1 Tim Burr
2 Falling Trees by Tim Burr 2 Miss D'Bus
3 Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus
I have successfully searched the descriptions to determine if it has a matching name by using a regex string with all the names:
regex = '$|'.join(map(re.escape, df['Names'])) + '$'
df['Reference'] = df['Description'].str.split(regex, expand=True)
to get
# Description Reference
-----------------------------------------------------------------------
0 A long walk by Miss D'Bus A long walk by
1 A day in the country by Teresa Green A day in the country by
2 Falling Trees by Tim Burr Falling Trees by
3 Evergreens by Teresa Green Evergreens by
4 Late for Dinner by Miss D'Bus Late for Dinner by
But I want the corresponding (= removed delimiter) Name as an additional column.
It tried adding *? to the regex like this
I tried splitting the 'Description' column by using the 'Reference' column
df['Name'] = df['Description'].str.split(df['Reference'])
I tried slicing the 'Description' column by using the length of the 'Reference' string like
# like: df['Name'] = df['Description'].str[-10:]
df['Name'] = df['Description'].str[-(df['Reference'].str.len()):]
but i get a constant slice length.
Upvotes: 1
Views: 67
Reputation: 626794
You may use Series.str.extract
to get both types of information from the original column:
regex = r'^(.*?)\s*({})$'.format('|'.join(map(re.escape, df['Names'])))
df[['Reference','Name']] = df['Description'].str.extract(regex, expand=True)
Output:
>>> df
Description Reference name
0 A long walk by Miss D'Bus A long walk by Miss D'Bus
1 A day in the country by Teresa Green A day in the country by Teresa Green
2 Falling Trees by Tim Burr Falling Trees by Tim Burr
3 Evergreens by Teresa Green Evergreens by Teresa Green
4 Late for Dinner by Miss D'Bus Late for Dinner by Miss D'Bus
The regex will look like ^(.*?)\s*(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)$
:
^
- start of string(.*?)
- Group 1 ("Reference"): any zero or more chars other than line break chars, as few as possible\s*
- 0+ whitespaces(Teresa\ Green|Tim\ Burr|Miss\ D\'Bus)
- Group 2 ("Name"): the alternation group with the known names$
- end of string.Upvotes: 2