Reputation: 3392
I have the following code that takes very long time to execute. The pandas DataFrames df
and df_plants
are very small (less than 1Mb). I wonder if there is any way to optimise this code:
import pandas as pd
import geopy.distance
import re
def is_inside_radius(latitude, longitude, df_plants, radius):
if (latitude != None and longitude != None):
lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
for index, row in df_plants.iterrows():
coords_1 = (lat, lon)
coords_2 = (row["latitude"], row["longitude"])
dist = geopy.distance.distance(coords_1, coords_2).km
if dist <= radius:
return 1
return 0
df["inside"] = df.apply(lambda row: is_inside_radius(row["latitude"],row["longitude"],df_plants,10), axis=1)
I use regex to process latitude and longitude in df
because the values contain some errors (characters) which should be deleted.
The function is_inside_radius
verifies if row[latitude]
and row[longitude]
are inside the radius of 10 km from any of the points in df_plants
.
Upvotes: 4
Views: 3490
Reputation: 4625
Can you try this?
import pandas as pd
from geopy import distance
import re
def is_inside_radius(latitude, longitude, df_plants, radius):
if (latitude != None and longitude != None):
lat = float(re.sub("[a-zA-Z]", "", str(latitude)))
lon = float(re.sub("[a-zA-Z]", "", str(longitude)))
coords_1 = (lat, lon)
for row in df_plants.itertuples():
coords_2 = (row["latitude"], row["longitude"])
if distance.distance(coords_1, coords_2).km <= radius:
return 1
return 0
df["inside"] = df.map(
lambda row: is_inside_radius(
row["latitude"],
row["longitude"],
df_plants,
10),
axis=1)
From https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html#pandas-dataframe-iterrows, pandas.DataFrame.itertuples()
returns namedtuples of the values which is generally faster than pandas.DataFrame.iterrows()
, and preserve dtypes across returned rows.
Upvotes: 4
Reputation: 17493
I've encountered such a problem before, and I see one simple optimisation: try to avoid the floating point calculation as much a possible, which you can do as follows:
Imagine:
You have a circle, defined by Mx and My (center coordinates) and R (radius).
You have a point, defined by is coordinates X and Y.
If your point (X,Y) is not even within the square, defined by (Mx, My) and size 2*R, then it will also not be within the circle, defined by (Mx, My) and radius R.
In pseudo-code:
function is_inside(X,Y,Mx,My,R):
if (abs(Mx-X) >= R) OR (abs(My-Y) >= R)
then return false
else:
// and only here you perform the floating point calculation
Upvotes: 2