Reputation: 2698
I'm trying to extract the name of the country from the following dataframe
country
0 NaN
1 Country: America
2 Country: France ...More CountriesFranceNorwayP...
3 NaN
4 Country: India
using the following regex statement
import re
regex = re.compile(\
r"Country: (?P<country>\w+)"
)
df['country'] = df['country'].str.extractall(regex).droplevel(1)
However it returns
country
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Instead of returning
country
0 NaN
1 America
2 France
3 NaN
4 India
What am I missing out on?
Please Advise
Upvotes: 2
Views: 778
Reputation: 34086
You can also avoid regex
and use Series.str.split
:
In [86]: df = pd.DataFrame({'country' : [np.nan, 'Country: America', 'Country: France ... More countries...', np.nan, 'Country: India']})
In [87]: df
Out[87]:
country
0 NaN
1 Country: America
2 Country: France ... More countries...
3 NaN
4 Country: India
In [94]: df.country.str.split(':').str[1].str.split().str[0]
Out[94]:
0 NaN
1 America
2 France
3 NaN
4 India
Name: country, dtype: object
Upvotes: 1
Reputation: 627536
You can use extract
:
df['country'] = df['country'].str.extract(r'Country:\s*(\w+)')
Pandas test:
import pandas as pd
import numpy as np
df = pd.DataFrame({'country' : [np.nan, 'Country: America', 'Country France ... More countries...']})
df['country'].str.extract(r'Country:\s*(\w+)')
# 0
# 0 NaN
# 1 America
# 2 NaN
Upvotes: 1