Reputation: 561
Is there any way to calculate the percentile of a scipy sparse matrix?
I do not want to convert the sparse matrix into a dense matrix due to memory concerns.
Here is a working of example of what I want using dense numpy arrays. I'm currently using a numpy version < 1.22, but I don't mind a solution using the latest numpy version.
>>> arr = 100 * np.random.rand(3,5)
>>> arr
array([[ 3.24955563, 76.40300826, 95.47390569, 24.19071006, 26.07447378],
[60.40003646, 38.50289778, 86.50299598, 27.00110588, 34.91898836],
[51.75939709, 99.00492787, 63.32860788, 23.91364962, 56.34410086]])
>>> col_q3 = np.percentile(arr, 75, interpolation='midpoint', axis=0)
>>> col_q3
array([56.07971677, 87.70396807, 90.98845084, 25.59590797, 45.63154461])
>>> row_q3 = np.percentile(arr, 75, interpolation='midpoint', axis=1)
>>> row_q3
array([76.40300826, 60.40003646, 63.32860788])
For me, the time it takes to calculate these values is not too important. I'm more concerned with memory usage.
Upvotes: 1
Views: 612
Reputation: 1304
Is this what you need ?
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html
from sklearn.preprocessing import QuantileTransformer
M = sparse.random(3, 5, 0.4, "csr")
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(M)
Upvotes: 0
Reputation: 231385
Actually I was hoping for a sparse example, such as:
In [45]: M = sparse.random(3, 5, 0.4, "csr")
In [46]: M
Out[46]:
<3x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [47]: M.A
Out[47]:
array([[0.44828545, 0.84567936, 0. , 0.23534173, 0. ],
[0.14978221, 0. , 0. , 0. , 0. ],
[0. , 0. , 0.32428732, 0. , 0.33813957]])
In [49]: arr = M.A
In [50]: np.percentile(arr, 75, interpolation="midpoint", axis=0)
....
Out[50]: array([0.29903383, 0.42283968, 0.16214366, 0.11767086, 0.16906979])
In [51]: np.percentile(arr, 75, interpolation="midpoint", axis=1)
....
np.percentile(arr, 75, interpolation="midpoint", axis=1)
Out[51]: array([0.44828545, 0. , 0.32428732])
1.22
has a deprecation warning about the use of the interpolation
parameter.
I assume you already searched the sparse
docs for percentile
. It think you/we need to dig into the np.percentile
code to determine exactly what it is doing - in terms of things like row/column sum
, multiply, etc.
Sparse implements things like sum
:
In [53]: arr.sum(axis=0)
Out[53]: array([0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957])
In [54]: M.sum(axis=0)
Out[54]: matrix([[0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957]])
The sparse sum is actually done with a matrix multiplication.
In [55]: np.ones(3) * M
Out[55]: array([0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957])
The nonzero values are:
In [56]: M.data
Out[56]:
array([0.44828545, 0.84567936, 0.23534173, 0.14978221, 0.32428732,
0.33813957])
though to get them by-row (or by column) requires an iteration.
In [58]: Ml = M.tolil()
In [59]: Ml.data
Out[59]:
array([list([0.44828545291437716, 0.8456793619879996, 0.23534172969892375]),
list([0.14978221447183726]),
list([0.32428731688363377, 0.33813957327426203])], dtype=object)
Upvotes: 2