Nikit Parakh
Nikit Parakh

Reputation: 75

Comparing value in dataframe and calculating another attribute using it

I have a pd Dataframe that has a lot of planes in the XY plane. The dataframe consists of the points' x and y coordinates. I want to check every point's distance to all other points using the pythagorean theorem and count number of points within a certain distance of that point.

def distance(x1, y1, x2, y2):
    return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y':[random.randint(1,100) for i in range(100)]})

I realise that I can loop over the dataframe but that is not best practice and it takes too long. Is there a way I can optimize this process.

Ultimately I'd want another column in the dataframe that stores the number of points in the dataframe that are within a certain distance of each point.

EDIT: Another thing I am trying to do is look for arbitrary points (or zones) in the XY plane with the most number of points within a given radius. What I basically mean is I want to also look at positions in the plane that are not necessarily points in the dataframe but are still within the limits of the plane.

Upvotes: 0

Views: 54

Answers (2)

Lukas S
Lukas S

Reputation: 3593

If you want your code to run fast using pandas and numpy you should try to get used to writing functions that look like they only work with numbers but you can actually input numpy arrays/pandas series. E.g. if you want to find all points in your df being distance r or less from the point cx, cy you could do that like so

def close_to_my_point(x,y):
    return (x-cx)**2+(y-cy)**2 <= r**2

close_to_my_point(df["X"],df["Y"])

This gives you a series of booleans indicating if your point at that position in the dataframe now is close to cx, cy or not. Notice now that when summing over True, False values True behaves like 1 and False like 0. So sum(close_to_my_point(df["X"],df["Y"])) does what you want for one point.

For functions that can't be applied to series by default there is np.vectorize to change that. Putting all that together you get something that can calculate the amount of points in some distance quite quickly:

def disk_equation(cx,cy,r):
    return lambda x,y: (x-cx)**2+(y-cy)**2<= r**2

points_in_distance = lambda x,y: sum(disk_equation(x,y,20)(df["X"],df["Y"]))
df["points_closer_than_20"] = np.vectorize(points_in_distance)(df["X"],df["Y"])

Upvotes: 1

NotAName
NotAName

Reputation: 4347

There is a whole lot of tools for pairwise distance calculations included in SciPy: enter link description here

The simplest one to use would be a distance_matrix that calculates pairwise distances and returns those as a matrix. First you need to convert your dataframe into a properly formatted numpy array:

import random

from scipy.spatial import distance_matrix
import pandas as pd
import numpy as np

df = pd.DataFrame({'X':[random.randint(1,100) for i in range(100)], 'Y': random.randint(1,100) for i in range(100)]})

foo = np.array([(x,y) for x, y in zip(df.X, df.Y)])
baz = distance_matrix(foo, foo)

Here we're using foo twice since we want all pairwise distances to all points in the array.

Upvotes: 1

Related Questions