Reputation: 1015
In my pandas dataframe, I have a column which contains user location. I have created a function to identify the country from the location and I want to create a new column with the country name. The function is:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
loc = geolocator.geocode(str(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
On passing any location name, the function will return the name of the country. For ex-
do_fuzzy_search("Bombay")
returns 'India'
.
I simply want to create a new column using apply function.
df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
But it's taking forever to run. I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed, Speed up pandas using dask or swift and Speed up pandas using cudf.
The time taken to execute just the first 10 rows of the column using various techniques are as follows:
%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using Swifter library:
%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0 United States of America
1 NaN
2 Australia
3 India
4 NaN
5 Australia
6 India
7 India
8 United Kingdom
9 Singapore
Name: User_loc, dtype: object
Using np.vectorize
%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
'Italia', 'Australia', 'India', 'India', 'United Kingdom',
'Singapore'], dtype='<U24')
Also, used Dask's map_partitions which did not give much performance gain over the apply function.
import dask.dataframe as dd
import multiprocessing
dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
.map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')
The computation time for 10 rows is more than 5 seconds for each technique. It's taking forever for 100k rows. I also tried to implement cudf but that's crashing my colab notebook.
What can I do to improve the performance and achieve the result in reasonable time?
Upvotes: 3
Views: 638
Reputation: 15568
This looks like IO bound and not CPU bound issue. Multiprocessing would not help. The major bottleneck is your call to Nominatim()
. You make a http request to their API for every non-NaN column. This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.
The optimisation of these would require a mixture of caching most frequently location
locally and also the new few calls during calls.
Nominatim()
on the most frequent locations and save this as lookup dict/json e.g. location_geo = df.set_index('location').to_dict()['geolocation']
In the end you would have something like this:
import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np
geolocator = Nominatim()
# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
location_geolocation = json.load(f)
@lru_cache
def do_fuzzy_search(location):
if type(location) == float and np.isnan(location):
return np.nan
else:
try:
result = pycountry.countries.search_fuzzy(location)
except Exception:
try:
# look first in our dictionary, if not call Nominatim
loc = locations_geolocation.get(location, geolocator.geocode(location))
return loc.raw['display_name'].split(', ')[-1]
except:
return np.nan
else:
return result[0].name
Upvotes: 1
Reputation: 14847
In most cases, an .apply()
is slow because it's calling some trivially parallelizable function once per row of a dataframe, but in your case, you're calling an external API. As such, network access and API rate limiting are likely to be the primary factors determining runtime. Unfortunately, that means there's not an awful lot you can do other than wait.
You might be able to benefit by decorating do_fuzzy_search
with functools.lru_cache if some elements are frequently repeated since that will allow the function to avoid the API call if the location is found in the cache.
Upvotes: 2