Reputation: 72
I am working on several datasets. One dataset (geodata - 74 observations) contains Indian district names, latitude and longitude of district centres while the other (called rainfall_2009) contains information on rainfall in a geographic grid as well as grid's latitude and longitude. The aim is to link each grid to a district such that the grid's distance from the district centre would be no more than 100 km. The dataset is big - 350 000 observations. I initially tried running 2 loops, but I know this is a very unPythonic way and in the end it was very inefficient, taking around 2.5h. I have managed to get rid of one of the loops, but it still takes 1.5h to run the code. Is there any further way I could optimise it?
# Create empty variables for district names and distance to the centre
rainfall_2009['district'] = np.nan
rainfall_2009['distance'] = np.nan
# Make a tuple of district centre geographic location (to be used in distance geodesic command)
geodata['location'] = pd.Series([tuple(i) for i in np.array((np.array(geodata.centroid_latitude) , np.array(geodata.centroid_longitude))).T])
# Run the loop for each grid in the dataset.
for i in tqdm(rainfall_2009.index):
place = (rainfall_2009.latitude.iloc[i], rainfall_2009.longitude.iloc[i]) # select grid's geographic data
distance = geodata.location.apply(lambda x: dist.geodesic(place, x).km) # construct series of distances between grid and all regional centers
if list(distance[distance<100]) == []: # If there are no sufficiently close district centers we just continue the loop
continue
else:
# We take the minimum distance to assign the closest region.
rainfall_2009.district.iloc[i] = geodata.distname_iaa.iloc[distance[distance < 100].idxmin()]
rainfall_2009.distance.iloc[i] = distance[distance < 100].min()
Upvotes: 0
Views: 48
Reputation: 3001
Can you pass pandas columns directly to dist.geodesic()
? Calling this via the apply() statement may be slow.
This example might be helpful (see the function gcd_vec()
in this blog post:
https://tomaugspurger.github.io/modern-4-performance
Also, can you perform fewer distance calculations? For example calculate distance from geographic grid to district center if the two end-points are in the same state or adjacent states?
UPDATE: The Numba package may speed this up further. You just import and apply a decorator. Details here: http://numba.pydata.org/numba-doc/latest/user/jit.html
from numba import jit
@jit
def gcd_vec():
# same as before
Upvotes: 1