Dylan
Dylan

Reputation: 913

pandas.replace conflict with str.replace regex. Code Order

My task is to remove any content in a parenthesis and remove any numbers followed by Country name. Change the names of a couple of countries.

e.g. Bolivia (Plurinational State of)' should be 'Bolivia' Switzerland17' should be 'Switzerland'`.

My original code was in the order:

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

energy['Country'] = energy['Country'].replace(dict1)
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
energy.loc[energy['Country'] == 'United States']

The str.replace part works fine. The tasks were completed. When I use the last line to check if I successfully changed the Country name. This original code doesn't work. However if I change the order of the code into:

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '') energy['Country'] = energy['Country'].str.replace('\d+', '') energy['Country'] = energy['Country'].replace(dict1)

Then it successfully changes the Country Name. So there must be something wrong with my Regex syntax, how to solve this conflict? Why is this happening?

Upvotes: 1

Views: 1460

Answers (1)

jezrael
jezrael

Reputation: 863176

The problem is that you need regex=True replace for replace substrings:

energy = pd.DataFrame({'Country':['United States of America4',
                                  'United States of America (aaa)','Slovakia']})
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

dict1 = {
"Republic of Korea": "South Korea",
"United States of America": "United States",
"United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
"China, Hong Kong Special Administrative Region": "Hong Kong"} 

#no replace beacuse no match (numbers and ()) 
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
                          Country
0       United States of America4
1  United States of America (aaa)
2                        Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

print (energy.loc[energy['Country'] == 'United States'])
Empty DataFrame
Columns: [Country]
Index: []

energy['Country'] = energy['Country'].replace(dict1, regex=True)
print (energy)
               Country
0       United States4
1  United States (aaa)
2             Slovakia

energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States

#first data cleaning
energy['Country'] = energy['Country'].str.replace(r' \(.*\)', '')
energy['Country'] = energy['Country'].str.replace('\d+', '')
print (energy)
                    Country
0  United States of America
1  United States of America
2                  Slovakia

#replace works nice
energy['Country'] = energy['Country'].replace(dict1)
print (energy)
         Country
0  United States
1  United States
2       Slovakia

print (energy.loc[energy['Country'] == 'United States'])
         Country
0  United States
1  United States

Upvotes: 3

Related Questions