Reputation: 115
I need to extract the last part of a string within a series to create new series(columns) in my dataframe. Below is the string column example
Joke_identifier |
---|
Klint De Drunk Enugu 1 |
Klint De Drunk Enugu 2 |
Klint De Drunk Enugu 3 |
Gordons Lagos 1 |
Gordons Lagos 2 |
I tried using this line of code to extract the last part of the string
import re
train_data.Joke_identifier.str.split(r"[a-zA-Z]\s[0-9]\Z", expand=True)
But it returns
Klint De Drunk Enug
I would get a new data frame as shown below
Joke_identifier | Joker | City | Edition |
---|---|---|---|
Klint De Drunk Enugu 1 | Klint De Drunk | Enugu | 1 |
Klint De Drunk Enugu 2 | Klint De Drunk | Enugu | 2 |
Klint De Drunk Enugu 3 | Klint De Drunk | Enugu | 3 |
Gordons Lagos 1 | Gordons | Lagos | 1 |
Gordons Lagos 2 | Gordons | Lagos | 2 |
Please help
Upvotes: 2
Views: 1145
Reputation: 626802
You can use Series.str.extract
:
import pandas as pd
df = pd.DataFrame({'Joke_identifier':['Klint De Drunk Enugu 1', 'Klint De Drunk Enugu 2', 'Klint De Drunk Enugu 3', 'Gordons Lagos 1','Gordons Lagos 2']})
df[['Joker','City','Edition']] = df['Joke_identifier'].str.extract(r"^(.*)\s+(\S+)\s+(\d+)$", expand=True)
df
# Joke_identifier Joker City Edition
# 0 Klint De Drunk Enugu 1 Klint De Drunk Enugu 1
# 1 Klint De Drunk Enugu 2 Klint De Drunk Enugu 2
# 2 Klint De Drunk Enugu 3 Klint De Drunk Enugu 3
# 3 Gordons Lagos 1 Gordons Lagos 1
# 4 Gordons Lagos 2 Gordons Lagos 2
See the regex demo.
The pattern matches
^
- start of string(.*)
- Capturing group 1 (Column "Joker"): any zero or more chars other than line break chars, as many as possible\s+
- one or more whitespaces(\S+)
- Capturing group 2 (Column "City"): any one or more whitespaces\s+
- one or more whitespaces(\d+)
- Capturing group 3 (Column "Edition"): one or more digits$
- end of string.Upvotes: 1