Lyss C
Lyss C

Reputation: 39

Absolute difference between variables in columns of different lengths

EDIT::::: So I need to employ this for x and y coordinates. So DF1 has 5384167 pairs and the DF2 has 1928 pairs.

I understand how @Divakar 's solution works for one variable. In my specific case I have Subway Coordinates and I'd like to compare their proximity to Crime Coordinates. See if there is a correlation with type of crime and euclidean distance from the closest Subway Station.

I know that sqrt(DF1[x]-DF2[x])**2 + (DF1[y]-DF2[y])**2) needs to be done between all X,Y's in subway locations and each X, Y in crime locations.

Crime Coordinates:

X_COORD_CD Y_COORD_CD 0 1007314.0 241257.0 1 1043991.0 193406.0 2 999463.0 231690.0 3 1060183.0 177862.0 4 987606.0 208148.0

Subway Coordinates:

X_COORD_CD Y_COORD_CD 0 1020671.0 248680.0 1 1019420.0 245867.0 2 1017558.0 245632.0

X_Crime Y_Crime X_Subway Y_Subway Dist_sub1 Dist_sub2 Dist_sub3 1007314 241257 1020671 248680 13356.72213 12105.8096 10243.78646 1043991 193406 1019420 245867 23318.81485 24569.93244 26432.01209 999463 231690 1017558 245632 21207.59944 19956.64481 18094.61475 1060183 177862 39511.10383 40762.16584 42624.20504 987606 208148 33064.38708 31813.40719 29951.37426

so Dist_sub1[0] = ((X_Crime[0] - X_Subway[0])**2 + (Y_Crime[0] - Y_Subway[0])**2)**0.5

Dist_sub2[3] = ((X_Crime[3] - X_Subway[1])**2 + (Y_Crime[3] - Y_Subway[1])**2)**0.5

So the minimum distance to a subway for Crime[0] is 10243.78646 (Subway[2]); Crime[1] is 23318.81485 (Subway[0]); Crime[2] is 18094.61475 (Subway[2]); Crime[3] is 39511.10383 (Subway[0]); Crime[4] is 29951.37426 (Subway[2])

END OF EDIT

I have a list of x and y coordinates in two separate dataframes. One has 5384167 entries and the other has 1928 entries.

I'm trying to figure out how to calculate the absolute distance between each entry in df1 and df2, then find the minimum. Example:

df1 :

x
1
2
3

df2 :

x
4
5
6
7

I know creating a matrix is the way to go but I literally do not know how to get there. The matrix would look like this:

3   2   1
4   3   2
5   4   3
6   5   4

Then I'd get the minimum for each column :

3   2   1 

Upvotes: 1

Views: 860

Answers (1)

Divakar
Divakar

Reputation: 221714

Two approaches could be suggested.

Approach #1

One approach with NumPy broacasting with heavy memory requirements would be -

def abs_min_broadcasting(a, b): # a, b are input arrays
    return np.abs(a[:,None]-b).min(axis=0)

output = abs_min_broadcasting(df2.x.values, df1.x.values)

Approach #2

Another memory efficient one and as such faster one would be with np.searchsorted -

def abs_min_searchsorted(a, b): # a, b are input arrays
    a_s = np.sort(a) # skip this if already sorted
    lidx = np.searchsorted(a_s,b).clip(max=len(a_s)-1)
    ridx = (lidx-1).clip(min=0)
    return np.minimum( np.abs(a_s[lidx] - b), np.abs(a_s[ridx] - b) )

output = abs_min_searchsorted(df2.x.values, df1.x.values)

Sample run -

In [866]: df1
Out[866]: 
   x
0  1
1  2
2  3

In [867]: df2
Out[867]: 
   x
0  4
1  5
2  6
3  7

In [869]: abs_min_broadcasting(df2.x.values, df1.x.values)
Out[869]: array([3, 2, 1])

In [870]: abs_min_searchsorted(df2.x.values, df1.x.values)
Out[870]: array([3, 2, 1])

Runtime test -

In [872]: df1 = pd.DataFrame({'x':np.random.randint(0,100,(10000))})

In [873]: df2 = pd.DataFrame({'x':np.random.randint(0,100,(1000))})

In [874]: %timeit abs_min_broadcasting(df2.x.values, df1.x.values)
10 loops, best of 3: 28.4 ms per loop

In [875]: %timeit abs_min_searchsorted(df2.x.values, df1.x.values)
1000 loops, best of 3: 663 µs per loop

In [876]: out1 = abs_min_searchsorted(df2.x.values, df1.x.values)
     ...: out2 = abs_min_broadcasting(df2.x.values, df1.x.values)
     ...: print np.allclose(out1, out2)
     ...: 
True

Upvotes: 1

Related Questions