Unsure how to further optimise (get rid of for loop)

Question

I am working on several datasets. One dataset (geodata - 74 observations) contains Indian district names, latitude and longitude of district centres while the other (called rainfall_2009) contains information on rainfall in a geographic grid as well as grid's latitude and longitude. The aim is to link each grid to a district such that the grid's distance from the district centre would be no more than 100 km. The dataset is big - 350 000 observations. I initially tried running 2 loops, but I know this is a very unPythonic way and in the end it was very inefficient, taking around 2.5h. I have managed to get rid of one of the loops, but it still takes 1.5h to run the code. Is there any further way I could optimise it?

# Create empty variables for district names and distance to the centre

rainfall_2009['district'] = np.nan
rainfall_2009['distance'] = np.nan

# Make a tuple of district centre geographic location (to be used in distance geodesic command)

geodata['location'] = pd.Series([tuple(i) for i in np.array((np.array(geodata.centroid_latitude) , np.array(geodata.centroid_longitude))).T])

# Run the loop for each grid in the dataset. 

for i in tqdm(rainfall_2009.index):
    place = (rainfall_2009.latitude.iloc[i], rainfall_2009.longitude.iloc[i]) # select grid's geographic data
    distance = geodata.location.apply(lambda x: dist.geodesic(place, x).km) # construct series of distances between grid and all regional centers
    if list(distance[distance<100]) == []: # If there are no sufficiently close district centers we just continue the loop
        continue
    else:
        # We take the minimum distance to assign the closest region. 
        rainfall_2009.district.iloc[i] = geodata.distname_iaa.iloc[distance[distance < 100].idxmin()]
        rainfall_2009.distance.iloc[i] = distance[distance < 100].min()

jsmart · Accepted Answer

Can you pass pandas columns directly to dist.geodesic()? Calling this via the apply() statement may be slow.

This example might be helpful (see the function gcd_vec() in this blog post: https://tomaugspurger.github.io/modern-4-performance

Also, can you perform fewer distance calculations? For example calculate distance from geographic grid to district center if the two end-points are in the same state or adjacent states?

UPDATE: The Numba package may speed this up further. You just import and apply a decorator. Details here: http://numba.pydata.org/numba-doc/latest/user/jit.html

from numba import jit

@jit
def gcd_vec():
    # same as before

Unsure how to further optimise (get rid of for loop)

Answers (1)

Related Questions