Reputation: 1112
I am trying to clean my df['Country'] variable by creating a new one df['Country Clean'] that takes the value of the country variable if it finds it in the df['Country'] column.
I figured out though that if I repeat my command, I will also delete my previous findings and i will only get a variable reporting the finding for 'Russia'
Is there a way to do this?
data = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy 1', 'Italie', 'Ecco', 'Russia is in Euroasia' , 'Yugoslavia', 'Russia']}
df = pd.DataFrame(data)
df['Country Clean'] = df['Country'].str.replace(r'(^.*Italy.*$)', 'Italy')
df['Country Clean'] = df['Country'].str.replace(r'(^.*Russia.*$)', 'Russia')
Expected output
data2 = {'Number':['1', '2', '1', '2', '1', '2'], 'Country':['Italy', 'Italy', NaN, 'Russia' , NaN , 'Russia']}
exp = pd.DataFrame(data2)
exp
Upvotes: 1
Views: 513
Reputation: 627083
I suggest first normalizing the country names and then change the Country Clean column values according to the allowed country list:
normalize_countries={"Italie": "Italy", "Rusia": "Russia"} # Spelling corrections
pattern = r"\b(?:{})\b".format("|".join(normalize_countries)) # Regex to find misspellings
countries = ["Italy", "Russia"] # Country list
df['Country Clean'] = df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
def applyFunc(s):
for e in countries:
if e in s:
return e
return 'NaN'
df['Country Clean'] = df['Country Clean'].apply(applyFunc)
Output:
>>> df
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italy
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia NaN
5 2 Russia Russia
The df['Country'].str.replace(pattern, lambda x: normalize_countries[x.group()])
line searches for all misspelt country names as whole words in the Country
column and replaces them with correct spelling variants.
You may also add whole word check when searching for countries if you use regexps in the countries
list and then use re.search
instead of if e in countries
in the applyFunc
.
Upvotes: 1
Reputation: 5473
Use:
In [15]: countries = ["italy", "russia", "yugoslavia", "italie"]
In [16]: for i in countries:df.loc[lambda x:x.Country.str.lower().str.contains(i), 'Country Clean'] = i.capitalize()
In [17]: df
Out[17]:
Number Country Country Clean
0 1 Italy 1 Italy
1 2 Italie Italie
2 1 Ecco NaN
3 2 Russia is in Euroasia Russia
4 1 Yugoslavia Yugoslavia
5 2 Russia Russia
Upvotes: 0