Pairwise Distance with Large NumPy Arrays (Chunking?)

Question

Problem: I have a vector that is approximately [350000, 1] and I wish to calculate the pair wise distance. This results in a [350000, 350000] matrix of integer datatype that does not fit into RAM. I eventually want to end up with a boolean (which fits into RAM) so I am currently doing this one element at a time but this is not very time efficient.

Edit: Standard sklearn and scipy functions do not work because of the size of the data -- but if I can chunk it somehow to use hard disk then I should be able to use these.

Problem Visualised: [a_1, a_2, a_3]^t -> [[a_1 - a_1, a_1 - a_2, a_1 - a_3], [a_2 - a_1, a_2 - a_2, a_2 - a_3], [a_3 - a_1, a_3 - a_2, a_3 - a_3]]

Note that only the upper triangle needs to be calculated as it is symmetric when taking the abs value.

Vectorised Code that Needs Chunking or Alternative Solution: I have found a way to compute the distance (subtraction) between all points that works on a small matrix using broadcasting but need a way to be able to do this on larger matrices without hitting RAM limitations.

Or maybe a better way to the MWE below that is quicker could be suggested?

distMatrix = np.absolute((points[np.newaxis, :, :] - points[:, np.newaxis, :])[:, :, 0])

Other Attempts: I have tried using dask and memmap but still get memory errors so must be doing something wrong. I have also tried memmap and manually chunking the data but do not obtain a full set of results so any help would be most appreciated.

MWE of Current Method:


## Data ##
#Note that the datatype and code may not match up exactly as just creating to demonstrate. Essentially want to take first column and create distance matrix with itself through subtracting, and then take 2nd and 3rd column and create euclidean distance matrix.

data = np.random.randint(1, 5, size=[350001,3])
minTime = 3
maxTime = 4
minDist = 1
maxDist = 2

### CODE ###
n = len(data)

for i in trange(n):
    for j in range(i+1, n):
        #Within time threshold?
        if minTime <= (data[j][idxT] - data[i][idxT]) <= maxTime:
            #Within distance threshold?
            xD = math.pow(data[j][idxX] - data[i][idxX], 2)
            yD = math.pow(data[j][idxY] - data[i][idxY], 2)
            d = math.sqrt(xD + yD)
            #If within  threshold then
            if minDist <= d <= maxDist:
                #DO SOMETHING

Reason: I have time, x_coordinate, y_coordinate vectors for approx 350000 points. I want to calculate the distance between all time points (simple subtraction) and the Euclidean distance between each (x,y) point. I then want to be able to identify all point pairs that are within a time and distance occurrence threshold of each other producing a boolean.

V. Ayrat · Accepted Answer

You can split you array to smaller sized ones and calculate the distances for each pair separately.

splits = np.array_split(data, 10)
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        # do something with m

as the most calculations occur in scipy overhead of python loops will be minimal.

If you boolean array fit into memory and you try to find values which in certain range you can do

import numpy as np
import scipy.spatial.distance


boolean = np.zeros((350, 350), dtype=np.bool_)
a = np.random.randn(350, 2)
splits = np.array_split(a, 10)
shift = splits[0].shape[0]
minDist = -0.5
maxDist = +0.5
for i in range(len(splits)):
    for j in range(i, len(splits)):
        m = scipy.spatial.distance.cdist(splits[i], splits[j])
        masked = (minDist <= m) & (m <= maxDist)
        boolean[i * shift: (i + 1) * shift, j * shift : (j + 1) * shift] = masked
        boolean[j * shift : (j + 1) * shift, i * shift: (i + 1) * shift] = masked.T

Pairwise Distance with Large NumPy Arrays (Chunking?)

Answers (1)

Related Questions