Jay Cheng
Jay Cheng

Reputation: 49

Return pandas column base on country short code with Geopy

I have a dataframe with about 100k rows and I would like to find the coordinates based on the country short code with geopy.

The original dataframe look like this:

index country city
0 CL SANTIAGO
1 BR SAO PAULO
2 BR ITUPEVA

To limit the number of requests I create another dataframe with unique "country" value, then look up location details from Geopy with following code:

from geopy.geocoders import Nominatim    
from geopy.extra.rate_limiter import RateLimiter    
country_list = raw_df.country.unique().tolist()    
short_code_df["location"] = short_code_df["short_code"].apply(geocode,language="en")

Which return me the below df:

| index | short_code |                   location                   |
|-------|------------|----------------------------------------------|
|     0 | CL         | (Chile, (-31.7613365, -71.3187697))          |
|     1 | BR         | (Brazil, (-10.3333333, -53.2))               |
|     2 | US         | (United States, (39.7837304, -100.445882))   |
|     3 | GB         | (United Kingdom, (54.7023545, -3.2765753))   |
|     4 | JP         | (Japan, (36.5748441, 139.2394179))           |
|     5 | CH         | (Switzerland, (46.7985624, 8.2319736))       |
|     6 | CN         | (China, (35.000074, 104.999927))             |
|     7 | HK         | (Hong Kong, China, (22.350627, 114.1849161)) |

What I would like is to return the Country, Lat, Lon in their respective column like below:

| index | short_code |      location        |      lat       |        lon     |
|-------|------------|----------------------|----------------|----------------|
|     0 | CL         | Chile                |    -31.7613365 |   - 71.3187697 |
|     1 | BR         | Brazil               |    -10.3333333 |          -53.2 |
|     2 | US         | United States        |     39.7837304 |     100.445882 |

I tried to slice it with a square bracket at the end but it throw me an error.

I also tried to use the pandas explode function but that was not working too.

Thanks first and appreciate your helps.

Upvotes: 1

Views: 138

Answers (1)

slymore
slymore

Reputation: 780

You can use Series.apply to solve this problem. The approach is slightly different depending on the type of location: does the column contain tuples or strings ?

If location contains tuples (tuple[str, tuple[float, float]):


import pandas as pd

df = pd.DataFrame(
    {
        "short_code": ["CL", "BR", "US"],
        "location": [
        ("Chile", (-31.7613365, -71.3187697)),
        ("Brazil", (-10.3333333, -53.2)),
        ("United States", (39.7837304, -100.445882))
    ]
    }
)

Using apply:


df["country"] = df["location"].apply(lambda x: x[0])
df["lat"] = df["location"].apply(lambda x: x[1][0])
df["lon"] = df["location"].apply(lambda x: x[1][1])

you get: enter image description here

If location is a string

df = pd.DataFrame(
    {
        "short_code": ["CL", "BR", "US"],
        "location": [
        "(Chile, (-31.7613365, -71.3187697))",
        "(Brazil, (-10.3333333, -53.2))",
        "(United States, (39.7837304, -100.445882))"
    ]
    }
)

In this case you can first use regular expression to parse location and transform it as tuples, then apply the same steps as above

import re
import ast


def parse_location_str(loc_str: str):
    pattern = re.compile(r'\((?P<country>.*?),(?P<coord>.*)\)')
    m = pattern.search(loc_str)
    if m is None:
        return None
    country = m.group('country')
    coords = ast.literal_eval(m.group('coord'))
    return (country, coords)

df['loc_parsed'] = df['location'].apply(parse_location_str)
df["country"] = df["loc_parsed"].apply(lambda x: x[0])
df["lat"] = df["loc_parsed"].apply(lambda x: x[1][0])
df["lon"] = df["loc_parsed"].apply(lambda x: x[1][1])

You get: enter image description here

Upvotes: 1

Related Questions