Reputation: 913
My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.
e.g. Bolivia (Plurinational State of)' should be 'Bolivia' Switzerland17' should be 'Switzerland'`.
My original code was in the order:
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']
The str.replace
part works fine. The tasks were completed.
When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy['Country'] = energy['Country'].replace(dict1)
Then it successfully changes the Country Name. So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?
Upvotes: 1
Views: 1460
Reputation: 863176
The problem is that you need regex=True
replace
for replace substrings
:
energy = pd.DataFrame({'Country':['United States of America4',
'United States of America (aaa)','Slovakia']})
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"}
#no replace beacuse no match (numbers and ())
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States of America4
1 United States of America (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []
energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
Country
0 United States4
1 United States (aaa)
2 Slovakia
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
Country
0 United States of America
1 United States of America
2 Slovakia
#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
Country
0 United States
1 United States
2 Slovakia
print (energy.loc[energy['Country'] == 'United States'])
Country
0 United States
1 United States
Upvotes: 3