Comparing value in dataframe and calculating another attribute using it

Question

I have a pd Dataframe that has a lot of planes in the XY plane. The dataframe consists of the points' x and y coordinates. I want to check every point's distance to all other points using the pythagorean theorem and count number of points within a certain distance of that point.

def distance(x1, y1, x2, y2):
    return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y':[random.randint(1,100) for i in range(100)]})

I realise that I can loop over the dataframe but that is not best practice and it takes too long. Is there a way I can optimize this process.

Ultimately I'd want another column in the dataframe that stores the number of points in the dataframe that are within a certain distance of each point.

EDIT: Another thing I am trying to do is look for arbitrary points (or zones) in the XY plane with the most number of points within a given radius. What I basically mean is I want to also look at positions in the plane that are not necessarily points in the dataframe but are still within the limits of the plane.

Lukas S · Accepted Answer

If you want your code to run fast using pandas and numpy you should try to get used to writing functions that look like they only work with numbers but you can actually input numpy arrays/pandas series. E.g. if you want to find all points in your df being distance r or less from the point cx, cy you could do that like so

def close_to_my_point(x,y):
    return (x-cx)**2+(y-cy)**2 <= r**2

close_to_my_point(df["X"],df["Y"])

This gives you a series of booleans indicating if your point at that position in the dataframe now is close to cx, cy or not. Notice now that when summing over True, False values True behaves like 1 and False like 0. So sum(close_to_my_point(df["X"],df["Y"])) does what you want for one point.

For functions that can't be applied to series by default there is np.vectorize to change that. Putting all that together you get something that can calculate the amount of points in some distance quite quickly:

def disk_equation(cx,cy,r):
    return lambda x,y: (x-cx)**2+(y-cy)**2<= r**2

points_in_distance = lambda x,y: sum(disk_equation(x,y,20)(df["X"],df["Y"]))
df["points_closer_than_20"] = np.vectorize(points_in_distance)(df["X"],df["Y"])

Comparing value in dataframe and calculating another attribute using it

Answers (2)

Related Questions