Maethor
Maethor

Reputation: 101

Country Data: Renaming countries according to dict in pandas

Hej!

I am working on some data analysis of country statistics. I now use data from different sources, and as soon has been seen, countries get called differently some times: while the World Bank calls it "United Kingdom and Northern Ireland", the WHO might call it simply "United Kingdom" and mean the same political construct (I am aware of the fact that England, Scotland and Wales are the "countries" not really UK).

I created a dictionary where I took most of the different names to standardise it to World Bank Data. This works like a charm in a list, but I need it in a pandas DataFrame, that I get from pd.read_csv. In example: If I have a very short dictionary

dict = {'US': 'USA'}

how can I translate this within my dataframe (set the column in df.country to the dict.key value)?

Showing it in example:

ID  country   val
1   US        some values

to:

ID  country  val
1   USA      some values

For my list conversion I used the following construct, where listB is the input and output list:

for key in dict:
    listB = [w.replace(key, dict[key]) for w in listB]

Any suggestions how to do this the most easily? Any help would be awesome!

P.S: On a further note, does anyone have an idea how to generate ISO 3166-1 alpha-3 codes (like Germany = GER, Sweden = SWE and so on?). That might be an extension of the question above.

Upvotes: 1

Views: 3179

Answers (2)

victoria55
victoria55

Reputation: 245

You can use the function clean_country() from the library DataPrep to transform a column of country names into ISO 3166-1 alpha-3 codes. Install DataPrep with pip install dataprep.

from dataprep.clean import clean_country
df = pd.DataFrame({"country": ["Germany", "Sweden", "US"]})

df2 = clean_country(df, "country", output_format="alpha-3")
df2
   country country_clean
0  Germany           DEU
1   Sweden           SWE
2       US           USA

Upvotes: 1

jezrael
jezrael

Reputation: 863531

Use replace:

df['country'] = df['country'].replace(dic)

And for ISO 3166-1 alpha-3 check answers.

I think simpliest is download it from here.

If want parse code from wikipedia is possible use this solution or rewrited for DataFrame in python 3:

from bs4 import BeautifulSoup
import requests

url = "http://en.wikipedia.org/wiki/ISO_3166-1"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

t = soup.findAll('table', {'class' : 'wikitable sortable'})[1]
L = []
cs = [th.findAll(text=True)[0] for th in t.findAll('th')]

for row in t.findAll("tr")[1:]:
    tds = row.findAll('td')
    raw_cols = [td.findAll(text=True) for td in tds]
    cols = []
    # country field contains differing numbers of elements, due to the flag -- 
    # only take the name
    cols.append(raw_cols[0][-1:][0])
    # for all other columns, use the first result text
    cols.extend([col[0] for col in raw_cols[1:]])
    L.append(cols)

df = pd.DataFrame(L, columns=cs)

print (df.head())
  English short name (upper/lower case) Alpha-2 code Alpha-3 code  \
0                           Afghanistan           AF          AFG   
1                         Åland Islands           AX          ALA   
2                               Albania           AL          ALB   
3                               Algeria           DZ          DZA   
4                        American Samoa           AS          ASM   

  Numeric code       Link to  Independent  
0          004  ISO 3166-2:AF         Yes  
1          248  ISO 3166-2:AX          No  
2          008  ISO 3166-2:AL         Yes  
3          012  ISO 3166-2:DZ         Yes  
4          016  ISO 3166-2:AS          No  

Upvotes: 2

Related Questions