Reputation: 49
I have a dataframe with about 100k rows and I would like to find the coordinates based on the country short code with geopy.
The original dataframe look like this:
index | country | city |
---|---|---|
0 | CL | SANTIAGO |
1 | BR | SAO PAULO |
2 | BR | ITUPEVA |
To limit the number of requests I create another dataframe with unique "country" value, then look up location details from Geopy with following code:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
country_list = raw_df.country.unique().tolist()
short_code_df["location"] = short_code_df["short_code"].apply(geocode,language="en")
Which return me the below df:
| index | short_code | location |
|-------|------------|----------------------------------------------|
| 0 | CL | (Chile, (-31.7613365, -71.3187697)) |
| 1 | BR | (Brazil, (-10.3333333, -53.2)) |
| 2 | US | (United States, (39.7837304, -100.445882)) |
| 3 | GB | (United Kingdom, (54.7023545, -3.2765753)) |
| 4 | JP | (Japan, (36.5748441, 139.2394179)) |
| 5 | CH | (Switzerland, (46.7985624, 8.2319736)) |
| 6 | CN | (China, (35.000074, 104.999927)) |
| 7 | HK | (Hong Kong, China, (22.350627, 114.1849161)) |
What I would like is to return the Country, Lat, Lon in their respective column like below:
| index | short_code | location | lat | lon |
|-------|------------|----------------------|----------------|----------------|
| 0 | CL | Chile | -31.7613365 | - 71.3187697 |
| 1 | BR | Brazil | -10.3333333 | -53.2 |
| 2 | US | United States | 39.7837304 | 100.445882 |
I tried to slice it with a square bracket at the end but it throw me an error.
I also tried to use the pandas explode function but that was not working too.
Thanks first and appreciate your helps.
Upvotes: 1
Views: 138
Reputation: 780
You can use Series.apply
to solve this problem. The approach is slightly different depending on the type of location
: does the column contain tuples or strings ?
location
contains tuples (tuple[str, tuple[float, float]):
import pandas as pd
df = pd.DataFrame(
{
"short_code": ["CL", "BR", "US"],
"location": [
("Chile", (-31.7613365, -71.3187697)),
("Brazil", (-10.3333333, -53.2)),
("United States", (39.7837304, -100.445882))
]
}
)
Using apply:
df["country"] = df["location"].apply(lambda x: x[0])
df["lat"] = df["location"].apply(lambda x: x[1][0])
df["lon"] = df["location"].apply(lambda x: x[1][1])
df = pd.DataFrame(
{
"short_code": ["CL", "BR", "US"],
"location": [
"(Chile, (-31.7613365, -71.3187697))",
"(Brazil, (-10.3333333, -53.2))",
"(United States, (39.7837304, -100.445882))"
]
}
)
In this case you can first use regular expression to parse location
and transform it as tuples, then apply the same steps as above
import re
import ast
def parse_location_str(loc_str: str):
pattern = re.compile(r'\((?P<country>.*?),(?P<coord>.*)\)')
m = pattern.search(loc_str)
if m is None:
return None
country = m.group('country')
coords = ast.literal_eval(m.group('coord'))
return (country, coords)
df['loc_parsed'] = df['location'].apply(parse_location_str)
df["country"] = df["loc_parsed"].apply(lambda x: x[0])
df["lat"] = df["loc_parsed"].apply(lambda x: x[1][0])
df["lon"] = df["loc_parsed"].apply(lambda x: x[1][1])
Upvotes: 1