Reputation: 2012
I want to plot values of mtx.todense()[1:2,:]
with x,y = cspdata2d[index]
to outimg
, where index
is from the 2nd dimensions of mtx.
It is like scanpy.pl.spatial
, but I want to plot to pixels. And scanpy is slow on huge data set.
Upvotes: 2
Views: 64
Reputation: 50826
Here is a relatively-efficient Numpy solution:
import numpy as np
from scipy.sparse.coo import coo_matrix
# Compute `mtx.toarray()[1:2,:]`
# Time: 0.76 s
selected_row = 1
select = mtx.row == selected_row
assert np.array_equiv(mtx.row[select], selected_row)
slice_data = mtx.data[select]
slice_rows = np.zeros(np.count_nonzero(select))
slice_cols = mtx.col[select]
mtx_sparse_slice = coo_matrix((slice_data, (slice_rows, slice_cols)), shape=(1,mtx.shape[1]))
mtx_dense_slice = mtx_sparse_slice.toarray()
# Compute `x,y = cspdata2d[index]` where index is from the 2nd dimensions of mtx
# Assume cspdata2d contains [x,y] values in each row and not [y,x] ones
# Assume the zeros values are not interesting so they are not included
# Time: 0.42 s
index = mtx_dense_slice[0,:]
nnz_index = np.nonzero(index)[0]
x = cspdata2d[nnz_index,0]
y = cspdata2d[nnz_index,1]
# Compute outimg
# Time: 0.75 s
outimg = np.zeros(shape=cspdata2d.max(axis=0), dtype=np.int16)
outimg[x, y] = nnz_index.astype(np.int16)
Here is a setup to test this with random values:
mtx_nnz = 357_558_594
mtx_nrows = 30_562
mtx_ncols = 96_629_557
mtx_data = np.random.randint(0, mtx_ncols, size=mtx_nnz).astype(np.int64) # Apparently values <= mtx_ncols
mtx_rows = np.random.randint(0, mtx_nrows, size=mtx_nnz).astype(np.int32)
mtx_cols = np.random.randint(0, mtx_ncols, size=mtx_nnz).astype(np.int32)
mtx = coo_matrix((mtx_data, (mtx_rows, mtx_cols)), shape=(mtx_nrows, mtx_ncols))
del mtx_data, mtx_rows, mtx_ncols # Save some memory space
cspdata2d = np.empty(shape=(mtx_ncols,2), dtype=np.int32)
cspdata2d[:,0] = np.random.randint(0, 51_968, size=mtx_ncols)
cspdata2d[:,1] = np.random.randint(0, 40_141, size=mtx_ncols)
The overall execution time is about 2 seconds on my machine which is relatively fast (though not optimal) for a (sequential) Numpy code considering the significant size of the input data (>=10 GiB data to read/write).
Upvotes: 0