Shen
Shen

Reputation: 3

How can I use Python to turn country code into full name and infer the country name based on the city name on an Excel file?

I'm a beginner in Python.

Now I have 2 columns on my Excel file. One is country column and the other one is city column.

For the country column, most of the values are shown in country code and some of them are shown in country full name, while some values are U.S.A states code and less than 1% of them are blank.

For the city column, it clearly shows the full city name (not city code), while nearly 20% of them are blank.

How can I use Python to create a new column to show the full country name based on the country code and remain the same name if it shows the full country name in the country column, and show the U.S.A states code as the United States in the new column?

The tricky part is, in the country column, take CO as the example, Co can stand for Columbia and Colorado, I cannot be sure whether it's a country or a state at the beginning, but when I check the corresponding city name I can know it's a country or a state (ex: Longmont for Colorado, Bogota for Columbia). How can I avoid this issue in the new column and infer the full country name in the new column based on the corresponding city name?

I appreciate your help!

Upvotes: -2

Views: 1821

Answers (3)

DarrylG
DarrylG

Reputation: 17166

Explanation

Coded the task using following logic.

  1. Process simple abbreviations such as U.S.
  2. A country length greater than 3
    1. Have Country and City
      • Find closest Country City Pair in Cities
    2. Country Only
      • Find closest country match in list of countries in two letter country codes
  3. Country length equals 3
    • Find country with 3 letter country codes
  4. Country length equals 2 (could be country or state code)
  5. Code does not exist in list of states
    • Must be country code, so look up country in two letter country codes
  6. Code does not exist in list of countries
    • Must be state code for USA, so country is United States
  7. Could be country or state code
    • Check if city with this as a state code
    • Check if city with this as a country code
    • Must be best match of these two possibilities

Note: String matching uses fuzzy matching to allow for flexibility in spelling of names rapidfuzz library was used over fuzzywuzzy since its an order of magnitude faster

Code

import pandas as pd
from rapidfuzz import fuzz

def find_closest_country(country):
    ' Country with the closest name in list of countries in country code '
    ratios = [fuzz.partial_ratio(country, x) for x in alpha2.values()]
    rated_countries = [(info, r) for info, r in zip(alpha2.values(), ratios)]
    
    # Best match with shortest name
    return sorted(rated_countries, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def check_city_country(city, country):
    ' City, Country pair closest in list of cities '
    ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(country, x['country']) for x in cities]
    rated_cities = [(info, r) for info, r in zip(cities, ratios)]
    
    # Best match with shortest name
    return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def check_city_subregion(city, subregion):
    ' City, subresion pair closest in list of cities '
    ratios = [fuzz.partial_ratio(city, x['name']) * fuzz.partial_ratio(subregion, x['subcountry']) for x in cities]
    rated_cities = [(info, r) for info, r in zip(cities, ratios)]
    
    # Best match with shortest name
    return sorted(rated_cities, key = lambda x: (x[1], -len(x[0])), reverse = True)[0]
    
def lookup(country, city):
    '''
        Finds country based upon country and city
        country - country name or country code
        city - name of city
    '''
    if country.lower() == 'u.s.':
        # Picks up common US acronym
        country = "US"
   
    if len(country) > 3:
        # Must be country since too long for abbreviation
        if city:
            # Find closest city country pair in list of cities
            city_info = check_city_country(city, country)
            if city_info:
                return city_info[0]['country']
       
        # No city, so find closest country in list of countries (2 code abbreviations reverse lookup)
        countries = find_closest_country(country)
        if countries:
            return countries[0]
        
        return None
    elif len(country) == 3:
        # 3 letter abbreviation
        country = country.upper()
        return alpha3.get(country, None)
    
    elif len(country) == 2:
        # Two letter country abbreviation
        country = country.upper()
        if not country in states:
            # Not a state code, so lookup contry from code
            return alpha2.get(country, None)
        
        if not country in alpha2:
            # Not a country code, so must be state code for US
            return "United States of America"
        
        # Could be country of state code
        
        if city:
            # Have 2 digit code (could be country or state)
            pos_country = alpha2[country]  # possible country
            pos_state = states[country]    # possible state
            
            # check closest country with this city
            pos_countries = check_city_country(city, pos_country)
            
            # If state code, country would be United States
            pos_us = check_city_country(city, "United States")
            
            if pos_countries[1] > pos_us[1]:
                # Provided better match as country code
                return pos_countries[0]['country']
            else:
                # Provided better match as state code (i.e. "United States")
                return pos_us[0]['country']
        else:
            return alpha2[country]
             
    else:
        return None
   

Data

# State Codes
# https://gist.github.com/rugbyprof/76575b470b6772ce8fa0c49e23931d97
states = {"AL":"Alabama","AK":"Alaska","AZ":"Arizona","AR":"Arkansas","CA":"California","CO":"Colorado","CT":"Connecticut","DE":"Delaware","FL":"Florida","GA":"Georgia","HI":"Hawaii","ID":"Idaho","IL":"Illinois","IN":"Indiana","IA":"Iowa","KS":"Kansas","KY":"Kentucky","LA":"Louisiana","ME":"Maine","MD":"Maryland","MA":"Massachusetts","MI":"Michigan","MN":"Minnesota","MS":"Mississippi","MO":"Missouri","MT":"Montana","NE":"Nebraska","NV":"Nevada","NH":"New Hampshire","NJ":"New Jersey","NM":"New Mexico","NY":"New York","NC":"North Carolina","ND":"North Dakota","OH":"Ohio","OK":"Oklahoma","OR":"Oregon","PA":"Pennsylvania","RI":"Rhode Island","SC":"South Carolina","SD":"South Dakota","TN":"Tennessee","TX":"Texas","UT":"Utah","VT":"Vermont","VA":"Virginia","WA":"Washington","WV":"West Virginia","WI":"Wisconsin","WY":"Wyoming"}

# two letter country codes
# https://gist.github.com/carlopires/1261951/d13ca7320a6abcd4b0aa800d351a31b54cefdff4
alpha2 = {
    'AD': 'Andorra',
    'AE': 'United Arab Emirates',
    'AF': 'Afghanistan',
    'AG': 'Antigua & Barbuda',
    'AI': 'Anguilla',
    'AL': 'Albania',
    'AM': 'Armenia',
    'AN': 'Netherlands Antilles',
    'AO': 'Angola',
    'AQ': 'Antarctica',
    'AR': 'Argentina',
    'AS': 'American Samoa',
    'AT': 'Austria',
    'AU': 'Australia',
    'AW': 'Aruba',
    'AZ': 'Azerbaijan',
    'BA': 'Bosnia and Herzegovina',
    'BB': 'Barbados',
    'BD': 'Bangladesh',
    'BE': 'Belgium',
    'BF': 'Burkina Faso',
    'BG': 'Bulgaria',
    'BH': 'Bahrain',
    'BI': 'Burundi',
    'BJ': 'Benin',
    'BM': 'Bermuda',
    'BN': 'Brunei Darussalam',
    'BO': 'Bolivia',
    'BR': 'Brazil',
    'BS': 'Bahama',
    'BT': 'Bhutan',
    'BU': 'Burma (no longer exists)',
    'BV': 'Bouvet Island',
    'BW': 'Botswana',
    'BY': 'Belarus',
    'BZ': 'Belize',
    'CA': 'Canada',
    'CC': 'Cocos (Keeling) Islands',
    'CF': 'Central African Republic',
    'CG': 'Congo',
    'CH': 'Switzerland',
    'CI': 'Côte D\'ivoire (Ivory Coast)',
    'CK': 'Cook Iislands',
    'CL': 'Chile',
    'CM': 'Cameroon',
    'CN': 'China',
    'CO': 'Colombia',
    'CR': 'Costa Rica',
    'CS': 'Czechoslovakia (no longer exists)',
    'CU': 'Cuba',
    'CV': 'Cape Verde',
    'CX': 'Christmas Island',
    'CY': 'Cyprus',
    'CZ': 'Czech Republic',
    'DD': 'German Democratic Republic (no longer exists)',
    'DE': 'Germany',
    'DJ': 'Djibouti',
    'DK': 'Denmark',
    'DM': 'Dominica',
    'DO': 'Dominican Republic',
    'DZ': 'Algeria',
    'EC': 'Ecuador',
    'EE': 'Estonia',
    'EG': 'Egypt',
    'EH': 'Western Sahara',
    'ER': 'Eritrea',
    'ES': 'Spain',
    'ET': 'Ethiopia',
    'FI': 'Finland',
    'FJ': 'Fiji',
    'FK': 'Falkland Islands (Malvinas)',
    'FM': 'Micronesia',
    'FO': 'Faroe Islands',
    'FR': 'France',
    'FX': 'France, Metropolitan',
    'GA': 'Gabon',
    'GB': 'United Kingdom (Great Britain)',
    'GD': 'Grenada',
    'GE': 'Georgia',
    'GF': 'French Guiana',
    'GH': 'Ghana',
    'GI': 'Gibraltar',
    'GL': 'Greenland',
    'GM': 'Gambia',
    'GN': 'Guinea',
    'GP': 'Guadeloupe',
    'GQ': 'Equatorial Guinea',
    'GR': 'Greece',
    'GS': 'South Georgia and the South Sandwich Islands',
    'GT': 'Guatemala',
    'GU': 'Guam',
    'GW': 'Guinea-Bissau',
    'GY': 'Guyana',
    'HK': 'Hong Kong',
    'HM': 'Heard & McDonald Islands',
    'HN': 'Honduras',
    'HR': 'Croatia',
    'HT': 'Haiti',
    'HU': 'Hungary',
    'ID': 'Indonesia',
    'IE': 'Ireland',
    'IL': 'Israel',
    'IN': 'India',
    'IO': 'British Indian Ocean Territory',
    'IQ': 'Iraq',
    'IR': 'Islamic Republic of Iran',
    'IS': 'Iceland',
    'IT': 'Italy',
    'JM': 'Jamaica',
    'JO': 'Jordan',
    'JP': 'Japan',
    'KE': 'Kenya',
    'KG': 'Kyrgyzstan',
    'KH': 'Cambodia',
    'KI': 'Kiribati',
    'KM': 'Comoros',
    'KN': 'St. Kitts and Nevis',
    'KP': 'Korea, Democratic People\'s Republic of',
    'KR': 'Korea, Republic of',
    'KW': 'Kuwait',
    'KY': 'Cayman Islands',
    'KZ': 'Kazakhstan',
    'LA': 'Lao People\'s Democratic Republic',
    'LB': 'Lebanon',
    'LC': 'Saint Lucia',
    'LI': 'Liechtenstein',
    'LK': 'Sri Lanka',
    'LR': 'Liberia',
    'LS': 'Lesotho',
    'LT': 'Lithuania',
    'LU': 'Luxembourg',
    'LV': 'Latvia',
    'LY': 'Libyan Arab Jamahiriya',
    'MA': 'Morocco',
    'MC': 'Monaco',
    'MD': 'Moldova, Republic of',
    'MG': 'Madagascar',
    'MH': 'Marshall Islands',
    'ML': 'Mali',
    'MN': 'Mongolia',
    'MM': 'Myanmar',
    'MO': 'Macau',
    'MP': 'Northern Mariana Islands',
    'MQ': 'Martinique',
    'MR': 'Mauritania',
    'MS': 'Monserrat',
    'MT': 'Malta',
    'MU': 'Mauritius',
    'MV': 'Maldives',
    'MW': 'Malawi',
    'MX': 'Mexico',
    'MY': 'Malaysia',
    'MZ': 'Mozambique',
    'NA': 'Namibia',
    'NC': 'New Caledonia',
    'NE': 'Niger',
    'NF': 'Norfolk Island',
    'NG': 'Nigeria',
    'NI': 'Nicaragua',
    'NL': 'Netherlands',
    'NO': 'Norway',
    'NP': 'Nepal',
    'NR': 'Nauru',
    'NT': 'Neutral Zone (no longer exists)',
    'NU': 'Niue',
    'NZ': 'New Zealand',
    'OM': 'Oman',
    'PA': 'Panama',
    'PE': 'Peru',
    'PF': 'French Polynesia',
    'PG': 'Papua New Guinea',
    'PH': 'Philippines',
    'PK': 'Pakistan',
    'PL': 'Poland',
    'PM': 'St. Pierre & Miquelon',
    'PN': 'Pitcairn',
    'PR': 'Puerto Rico',
    'PT': 'Portugal',
    'PW': 'Palau',
    'PY': 'Paraguay',
    'QA': 'Qatar',
    'RE': 'Réunion',
    'RO': 'Romania',
    'RU': 'Russian Federation',
    'RW': 'Rwanda',
    'SA': 'Saudi Arabia',
    'SB': 'Solomon Islands',
    'SC': 'Seychelles',
    'SD': 'Sudan',
    'SE': 'Sweden',
    'SG': 'Singapore',
    'SH': 'St. Helena',
    'SI': 'Slovenia',
    'SJ': 'Svalbard & Jan Mayen Islands',
    'SK': 'Slovakia',
    'SL': 'Sierra Leone',
    'SM': 'San Marino',
    'SN': 'Senegal',
    'SO': 'Somalia',
    'SR': 'Suriname',
    'ST': 'Sao Tome & Principe',
    'SU': 'Union of Soviet Socialist Republics (no longer exists)',
    'SV': 'El Salvador',
    'SY': 'Syrian Arab Republic',
    'SZ': 'Swaziland',
    'TC': 'Turks & Caicos Islands',
    'TD': 'Chad',
    'TF': 'French Southern Territories',
    'TG': 'Togo',
    'TH': 'Thailand',
    'TJ': 'Tajikistan',
    'TK': 'Tokelau',
    'TM': 'Turkmenistan',
    'TN': 'Tunisia',
    'TO': 'Tonga',
    'TP': 'East Timor',
    'TR': 'Turkey',
    'TT': 'Trinidad & Tobago',
    'TV': 'Tuvalu',
    'TW': 'Taiwan, Province of China',
    'TZ': 'Tanzania, United Republic of',
    'UA': 'Ukraine',
    'UG': 'Uganda',
    'UM': 'United States Minor Outlying Islands',
    'US': 'United States of America',
    'UY': 'Uruguay',
    'UZ': 'Uzbekistan',
    'VA': 'Vatican City State (Holy See)',
    'VC': 'St. Vincent & the Grenadines',
    'VE': 'Venezuela',
    'VG': 'British Virgin Islands',
    'VI': 'United States Virgin Islands',
    'VN': 'Viet Nam',
    'VU': 'Vanuatu',
    'WF': 'Wallis & Futuna Islands',
    'WS': 'Samoa',
    'YD': 'Democratic Yemen (no longer exists)',
    'YE': 'Yemen',
    'YT': 'Mayotte',
    'YU': 'Yugoslavia',
    'ZA': 'South Africa',
    'ZM': 'Zambia',
    'ZR': 'Zaire',
    'ZW': 'Zimbabwe',
    'ZZ': 'Unknown or unspecified country',
}

# Three letter codes
#https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Uses_and_applications
alpha3 = """ABW  Aruba
AFG  Afghanistan
AGO  Angola
AIA  Anguilla
ALA  Åland Islands
ALB  Albania
AND  Andorra
ARE  United Arab Emirates
ARG  Argentina
ARM  Armenia
ASM  American Samoa
ATA  Antarctica
ATF  French Southern Territories
ATG  Antigua and Barbuda
AUS  Australia
AUT  Austria
AZE  Azerbaijan
BDI  Burundi
BEL  Belgium
BEN  Benin
BES  Bonaire, Sint Eustatius and Saba
BFA  Burkina Faso
BGD  Bangladesh
BGR  Bulgaria
BHR  Bahrain
BHS  Bahamas
BIH  Bosnia and Herzegovina
BLM  Saint Barthélemy
BLR  Belarus
BLZ  Belize
BMU  Bermuda
BOL  Bolivia (Plurinational State of)
BRA  Brazil
BRB  Barbados
BRN  Brunei Darussalam
BTN  Bhutan
BVT  Bouvet Island
BWA  Botswana
CAF  Central African Republic
CAN  Canada
CCK  Cocos (Keeling) Islands
CHE  Switzerland
CHL  Chile
CHN  China
CIV  Côte d'Ivoire
CMR  Cameroon
COD  Congo, Democratic Republic of the
COG  Congo
COK  Cook Islands
COL  Colombia
COM  Comoros
CPV  Cabo Verde
CRI  Costa Rica
CUB  Cuba
CUW  Curaçao
CXR  Christmas Island
CYM  Cayman Islands
CYP  Cyprus
CZE  Czechia
DEU  Germany
DJI  Djibouti
DMA  Dominica
DNK  Denmark
DOM  Dominican Republic
DZA  Algeria
ECU  Ecuador
EGY  Egypt
ERI  Eritrea
ESH  Western Sahara
ESP  Spain
EST  Estonia
ETH  Ethiopia
FIN  Finland
FJI  Fiji
FLK  Falkland Islands (Malvinas)
FRA  France
FRO  Faroe Islands
FSM  Micronesia (Federated States of)
GAB  Gabon
GBR  United Kingdom of Great Britain and Northern Ireland
GEO  Georgia
GGY  Guernsey
GHA  Ghana
GIB  Gibraltar
GIN  Guinea
GLP  Guadeloupe
GMB  Gambia
GNB  Guinea-Bissau
GNQ  Equatorial Guinea
GRC  Greece
GRD  Grenada
GRL  Greenland
GTM  Guatemala
GUF  French Guiana
GUM  Guam
GUY  Guyana
HKG  Hong Kong
HMD  Heard Island and McDonald Islands
HND  Honduras
HRV  Croatia
HTI  Haiti
HUN  Hungary
IDN  Indonesia
IMN  Isle of Man
IND  India
IOT  British Indian Ocean Territory
IRL  Ireland
IRN  Iran (Islamic Republic of)
IRQ  Iraq
ISL  Iceland
ISR  Israel
ITA  Italy
JAM  Jamaica
JEY  Jersey
JOR  Jordan
JPN  Japan
KAZ  Kazakhstan
KEN  Kenya
KGZ  Kyrgyzstan
KHM  Cambodia
KIR  Kiribati
KNA  Saint Kitts and Nevis
KOR  Korea, Republic of
KWT  Kuwait
LAO  Lao People's Democratic Republic
LBN  Lebanon
LBR  Liberia
LBY  Libya
LCA  Saint Lucia
LIE  Liechtenstein
LKA  Sri Lanka
LSO  Lesotho
LTU  Lithuania
LUX  Luxembourg
LVA  Latvia
MAC  Macao
MAF  Saint Martin (French part)
MAR  Morocco
MCO  Monaco
MDA  Moldova, Republic of
MDG  Madagascar
MDV  Maldives
MEX  Mexico
MHL  Marshall Islands
MKD  North Macedonia
MLI  Mali
MLT  Malta
MMR  Myanmar
MNE  Montenegro
MNG  Mongolia
MNP  Northern Mariana Islands
MOZ  Mozambique
MRT  Mauritania
MSR  Montserrat
MTQ  Martinique
MUS  Mauritius
MWI  Malawi
MYS  Malaysia
MYT  Mayotte
NAM  Namibia
NCL  New Caledonia
NER  Niger
NFK  Norfolk Island
NGA  Nigeria
NIC  Nicaragua
NIU  Niue
NLD  Netherlands
NOR  Norway
NPL  Nepal
NRU  Nauru
NZL  New Zealand
OMN  Oman
PAK  Pakistan
PAN  Panama
PCN  Pitcairn
PER  Peru
PHL  Philippines
PLW  Palau
PNG  Papua New Guinea
POL  Poland
PRI  Puerto Rico
PRK  Korea (Democratic People's Republic of)
PRT  Portugal
PRY  Paraguay
PSE  Palestine, State of
PYF  French Polynesia
QAT  Qatar
REU  Réunion
ROU  Romania
RUS  Russian Federation
RWA  Rwanda
SAU  Saudi Arabia
SDN  Sudan
SEN  Senegal
SGP  Singapore
SGS  South Georgia and the South Sandwich Islands
SHN  Saint Helena, Ascension and Tristan da Cunha
SJM  Svalbard and Jan Mayen
SLB  Solomon Islands
SLE  Sierra Leone
SLV  El Salvador
SMR  San Marino
SOM  Somalia
SPM  Saint Pierre and Miquelon
SRB  Serbia
SSD  South Sudan
STP  Sao Tome and Principe
SUR  Suriname
SVK  Slovakia
SVN  Slovenia
SWE  Sweden
SWZ  Eswatini
SXM  Sint Maarten (Dutch part)
SYC  Seychelles
SYR  Syrian Arab Republic
TCA  Turks and Caicos Islands
TCD  Chad
TGO  Togo
THA  Thailand
TJK  Tajikistan
TKL  Tokelau
TKM  Turkmenistan
TLS  Timor-Leste
TON  Tonga
TTO  Trinidad and Tobago
TUN  Tunisia
TUR  Turkey
TUV  Tuvalu
TWN  Taiwan, Province of China
TZA  Tanzania, United Republic of
UGA  Uganda
UKR  Ukraine
UMI  United States Minor Outlying Islands
URY  Uruguay
USA  United States of America
UZB  Uzbekistan
VAT  Holy See
VCT  Saint Vincent and the Grenadines
VEN  Venezuela (Bolivarian Republic of)
VGB  Virgin Islands (British)
VIR  Virgin Islands (U.S.)
VNM  Viet Nam
VUT  Vanuatu
WLF  Wallis and Futuna
WSM  Samoa
YEM  Yemen
ZAF  South Africa
ZMB  Zambia
ZWE  Zimbabwe"""

# Convert to dictionary
alpha3 = dict(tuple(re.split(r" {2,}", s)) for s in alpha3.split('\n'))

# List of World Cities & Country
# cities https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv
# Online CSV File

import csv
import urllib.request
import io

def csv_import(url):
    url_open = urllib.request.urlopen(url)
    csvfile = csv.DictReader(io.StringIO(url_open.read().decode('utf-8')), delimiter=',') 
    return csvfile

url = 'https://pkgstore.datahub.io/core/world-cities/world-cities_csv/data/6cc66692f0e82b18216a48443b6b95da/world-cities_csv.csv'

cities = csv_import(url)

Test

Excel File (Input)

country city
u.s.    
DZ  
AS  
co  Longmont
co  Bogota
AL  
AL  Huntsville
usa 
AFG 
BLR Minsk
AUS 
united states   
Korea   seoul
Korea   Pyongyang

Test Code

df = pd.read_excel('country_test.xlsx') # Load Excel File
df.fillna('', inplace=True)

# Get name of country based upon country and city
df['country_'] = df.apply(lambda row: lookup(row['country'], row['city']), axis = 1)

Resulting Dataframe

       country        city                  country_
0            u.s.              United States of America
1              DZ                               Algeria
2              AS                        American Samoa
3              co    Longmont             United States
4              co      Bogota                  Colombia
5              AL                               Albania
6              AL  Huntsville             United States
7             usa              United States of America
8             AFG                           Afghanistan
9             BLR       Minsk                   Belarus
10            AUS                             Australia
11  united states              United States of America
12          Korea       seoul               South Korea
13          Korea   Pyongyang               North Korea

Upvotes: 0

alexiao
alexiao

Reputation: 36

An advice for this approach is to create dictionaries(i.e. dic = {'CO':'Colombia',...} and dic_state = {'CO':'Colorado',...}). Then, probably have an if statement to check if the country is USA. If USA, then use dic_state. Finally, you can create a new column by using the appropriate command (this depends on the package/module that you are using)

Good luck!

Upvotes: 0

DumbLoawai
DumbLoawai

Reputation: 11

Well, You can have a {key (state) : Values (cities belonging to states)} json and use python to read the file and arrange the list to the corresponding city, state.

Upvotes: 0

Related Questions