Shubhanshu
Shubhanshu

Reputation: 1015

How to speed up Pandas apply function to create a new column in the dataframe?

In my pandas dataframe, I have a column which contains user location. I have created a function to identify the country from the location and I want to create a new column with the country name. The function is:

from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np

def do_fuzzy_search(location):
    if type(location) == float and np.isnan(location):
        return np.nan
    else:
      try:
          result = pycountry.countries.search_fuzzy(location)
      except Exception:
          try:
              loc = geolocator.geocode(str(location))
              return loc.raw['display_name'].split(', ')[-1]
          except:
              return np.nan
      else:
          return result[0].name

On passing any location name, the function will return the name of the country. For ex-

do_fuzzy_search("Bombay") returns 'India'.

I simply want to create a new column using apply function.

df['country'] = df.user_location.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)

But it's taking forever to run. I have tried a few techniques mentioned in other questions posted on Stackoverflow and blogs written with the same theme, like Performance of Pandas apply vs np.vectorize , Optimizing Pandas Code for Speed, Speed up pandas using dask or swift and Speed up pandas using cudf.

The time taken to execute just the first 10 rows of the column using various techniques are as follows:

%%time
attractions.User_loc[:10].apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 27 ms, sys: 1.18 ms, total: 28.2 ms
Wall time: 6.59 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

Using Swifter library:

%%time
attractions.User_loc[:10].swifter.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)
CPU times: user 1.03 s, sys: 17.9 ms, total: 1.04 s
Wall time: 7.94 s
0    United States of America
1                         NaN
2                   Australia
3                       India
4                         NaN
5                   Australia
6                       India
7                       India
8              United Kingdom
9                   Singapore
Name: User_loc, dtype: object

Using np.vectorize

%%time
np.vectorize(do_fuzzy_search)(attractions['User_loc'][:10])
CPU times: user 34.3 ms, sys: 3.13 ms, total: 37.4 ms
Wall time: 9.05 s
array(['United States of America', 'Italia', 'Australia', 'India',
       'Italia', 'Australia', 'India', 'India', 'United Kingdom',
       'Singapore'], dtype='<U24')

Also, used Dask's map_partitions which did not give much performance gain over the apply function.

import dask.dataframe as dd
import multiprocessing

dd.from_pandas(attractions.User_loc, npartitions=4*multiprocessing.cpu_count())\
   .map_partitions(lambda df: df.apply(lambda row: do_fuzzy_search(row) if (pd.notnull(row)) else row)).compute(scheduler='processes')

The computation time for 10 rows is more than 5 seconds for each technique. It's taking forever for 100k rows. I also tried to implement cudf but that's crashing my colab notebook.

What can I do to improve the performance and achieve the result in reasonable time?

Upvotes: 3

Views: 638

Answers (2)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15568

This looks like IO bound and not CPU bound issue. Multiprocessing would not help. The major bottleneck is your call to Nominatim(). You make a http request to their API for every non-NaN column. This means if 'India' is in 5 places, you will make 5 calls for India which wastefully returns the same geolocation for 5 rows.

The optimisation of these would require a mixture of caching most frequently location locally and also the new few calls during calls.

  1. Create a DataFrame with most N frequent locations.
  2. Call Nominatim() on the most frequent locations and save this as lookup dict/json e.g. location_geo = df.set_index('location').to_dict()['geolocation']
  3. Save it e.g. 'json.dump...`
  4. In your function, we will check if the location is in your cached location_geo dictionary, and return the value. If not then make a call to Nominatim API.

In the end you would have something like this:

import json
from functools import lru_cache
from geopy.geocoders import Nominatim
import numpy as np

geolocator = Nominatim()

# load most frequently locations
with open('our_save_freq_location.json', 'r') as f:
    location_geolocation = json.load(f)

@lru_cache
def do_fuzzy_search(location):
    if type(location) == float and np.isnan(location):
        return np.nan
    else:
      try:
          result = pycountry.countries.search_fuzzy(location)
      except Exception:
          try:
              # look first in our dictionary, if not call Nominatim
              loc = locations_geolocation.get(location, geolocator.geocode(location))
              return loc.raw['display_name'].split(', ')[-1]
          except:
              return np.nan
      else:
          return result[0].name

Upvotes: 1

Randy
Randy

Reputation: 14847

In most cases, an .apply() is slow because it's calling some trivially parallelizable function once per row of a dataframe, but in your case, you're calling an external API. As such, network access and API rate limiting are likely to be the primary factors determining runtime. Unfortunately, that means there's not an awful lot you can do other than wait.

You might be able to benefit by decorating do_fuzzy_search with functools.lru_cache if some elements are frequently repeated since that will allow the function to avoid the API call if the location is found in the cache.

Upvotes: 2

Related Questions