kyzerkenso
kyzerkenso

Reputation: 33

compare each data points between every data point between two dataframes without looping

I'd like to check coordinates (x,y,z) from dataframe-1 (df1) to see if the location is close enough to an irregular surface that has its own coordinates (x,y,z) stored in dataframe-2 (df2).

I'm able to go through each coordinate in df1, then loop through all coordinates in df2 and check it's distance. Then repeat for all coordinates in df1, but this would take sooooo long when I have over 1,000,000 coordinates in df1 to check.

I'm using pandas and wondering if it can be done without looping.

If coordinate in df1 is close to df2 then I want to select it and store it into df3.

Upvotes: 1

Views: 441

Answers (2)

AlexK
AlexK

Reputation: 3011

Using Numpy methods:

If your two dataframes look like this:

df1
    coords
0   (4,3,5)
1   (5,4,3)

df2
    coords
0   (6,7,8)
1   (8,7,6)

then:

import numpy as np
from itertools import product

#convert dataframes into numpy arrays
df1_arr = np.array([np.array(x) for x in df1.coords.values])
df2_arr = np.array([np.array(x) for x in df2.coords.values])

#create array of cartesian product of elements of the two arrays
cart_arr = np.array([x for x in product(df1_arr,df2_arr)])

#compute Euclidian distance (or norm) between pairs of elements in two arrays
#outputs new array with one value per pair of coordinates
norms_arr = np.linalg.norm(np.diff(cart_arr,axis=1)[:,0,:],axis=1)

#create distance threshold for "close enough"
radius = 5.5

#find values in norms array that are less than or equal to distance threshold
good_idxs = np.argwhere(norms_arr <= radius)[:,0]
good_coord_pairs = cart_arr[good_idxs]

#store corresponding pairs of coordinates and distances in new dataframe
final_df = pd.DataFrame({'df1_coords':list(map(tuple,good_coord_pairs[:,0,:])),
   'df2_coords':list(map(tuple(good_coord_pairs[:,1,:])), 'distance':norms_arr[good_idxs],
   index=list(range(len(good_coord_pairs))))

will produce:

final_df
    df1_coords  df2_coords  distance
0   (4,3,5)     (6,7,8)     5.385165
1   (5,4,3)     (8,7,6)     5.196152

Upvotes: 1

bubble
bubble

Reputation: 1672

Scipy could help you. Look at the following hypothetical example:

import pandas as pd 
from scipy.spatial import cKDTree

dataset1 = pd.DataFrame(pd.np.random.rand(100,3))
dataset2 = pd.DataFrame(pd.np.random.rand(10, 3))

ck = cKDTree(dataset1.values)

ck.query_ball_point(dataset2.values, r=0.1)

array([list([]), list([]), list([]), list([]), list([28, 83]), list([79]), list([]), list([86]), list([40]), list([29, 60, 95])], dtype=object)

Upvotes: 1

Related Questions