Reputation: 773
I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
Upvotes: 2
Views: 1999
Reputation: 887
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point. I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment): You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])
Upvotes: 1
Reputation: 114330
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j]
like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse
supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix
confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree
or the similar scipy.spatial.cKDTree
to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
Upvotes: 3