Reputation: 75
I have a pd Dataframe that has a lot of planes in the XY plane. The dataframe consists of the points' x and y coordinates. I want to check every point's distance to all other points using the pythagorean theorem and count number of points within a certain distance of that point.
def distance(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y':[random.randint(1,100) for i in range(100)]})
I realise that I can loop over the dataframe but that is not best practice and it takes too long. Is there a way I can optimize this process.
Ultimately I'd want another column in the dataframe that stores the number of points in the dataframe that are within a certain distance of each point.
EDIT: Another thing I am trying to do is look for arbitrary points (or zones) in the XY plane with the most number of points within a given radius. What I basically mean is I want to also look at positions in the plane that are not necessarily points in the dataframe but are still within the limits of the plane.
Upvotes: 0
Views: 54
Reputation: 3593
If you want your code to run fast using pandas and numpy you should try to get used to writing functions that look like they only work with numbers but you can actually input numpy arrays/pandas series. E.g. if you want to find all points in your df being distance r
or less from the point cx, cy
you could do that like so
def close_to_my_point(x,y):
return (x-cx)**2+(y-cy)**2 <= r**2
close_to_my_point(df["X"],df["Y"])
This gives you a series of booleans indicating if your point at that position in the dataframe now is close to cx, cy
or not. Notice now that when summing over True, False values True behaves like 1 and False like 0. So sum(close_to_my_point(df["X"],df["Y"]))
does what you want for one point.
For functions that can't be applied to series by default there is np.vectorize to change that. Putting all that together you get something that can calculate the amount of points in some distance quite quickly:
def disk_equation(cx,cy,r):
return lambda x,y: (x-cx)**2+(y-cy)**2<= r**2
points_in_distance = lambda x,y: sum(disk_equation(x,y,20)(df["X"],df["Y"]))
df["points_closer_than_20"] = np.vectorize(points_in_distance)(df["X"],df["Y"])
Upvotes: 1
Reputation: 4347
There is a whole lot of tools for pairwise distance calculations included in SciPy
: enter link description here
The simplest one to use would be a distance_matrix
that calculates pairwise distances and returns those as a matrix. First you need to convert your dataframe into a properly formatted numpy array:
import random
from scipy.spatial import distance_matrix
import pandas as pd
import numpy as np
df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y': random.randint(1,100) for i in range(100)]})
foo = np.array([(x,y) for x, y in zip(df.X, df.Y)])
baz = distance_matrix(foo, foo)
Here we're using foo
twice since we want all pairwise distances to all points in the array.
Upvotes: 1