yippingAppa
yippingAppa

Reputation: 561

How to calculate percentiles in a scipy sparse matrix

Is there any way to calculate the percentile of a scipy sparse matrix?

I do not want to convert the sparse matrix into a dense matrix due to memory concerns.

Here is a working of example of what I want using dense numpy arrays. I'm currently using a numpy version < 1.22, but I don't mind a solution using the latest numpy version.

>>> arr = 100 * np.random.rand(3,5)
>>> arr
array([[ 3.24955563, 76.40300826, 95.47390569, 24.19071006, 26.07447378],
       [60.40003646, 38.50289778, 86.50299598, 27.00110588, 34.91898836],
       [51.75939709, 99.00492787, 63.32860788, 23.91364962, 56.34410086]])

>>> col_q3 = np.percentile(arr, 75, interpolation='midpoint', axis=0)
>>> col_q3
array([56.07971677, 87.70396807, 90.98845084, 25.59590797, 45.63154461])

>>> row_q3 = np.percentile(arr, 75, interpolation='midpoint', axis=1)
>>> row_q3
array([76.40300826, 60.40003646, 63.32860788])

For me, the time it takes to calculate these values is not too important. I'm more concerned with memory usage.

Upvotes: 1

Views: 612

Answers (2)

Hugo
Hugo

Reputation: 1304

Is this what you need ?

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html

from sklearn.preprocessing import QuantileTransformer

M = sparse.random(3, 5, 0.4, "csr")
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(M)

Upvotes: 0

hpaulj
hpaulj

Reputation: 231385

Actually I was hoping for a sparse example, such as:

In [45]: M = sparse.random(3, 5, 0.4, "csr")
In [46]: M
Out[46]: 
<3x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [47]: M.A
Out[47]: 
array([[0.44828545, 0.84567936, 0.        , 0.23534173, 0.        ],
       [0.14978221, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.32428732, 0.        , 0.33813957]])

In [49]: arr = M.A
In [50]: np.percentile(arr, 75, interpolation="midpoint", axis=0)
....
Out[50]: array([0.29903383, 0.42283968, 0.16214366, 0.11767086, 0.16906979])
In [51]: np.percentile(arr, 75, interpolation="midpoint", axis=1)
....
  np.percentile(arr, 75, interpolation="midpoint", axis=1)
Out[51]: array([0.44828545, 0.        , 0.32428732])

1.22 has a deprecation warning about the use of the interpolation parameter.

I assume you already searched the sparse docs for percentile. It think you/we need to dig into the np.percentile code to determine exactly what it is doing - in terms of things like row/column sum, multiply, etc.

Sparse implements things like sum:

In [53]: arr.sum(axis=0)
Out[53]: array([0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957])
In [54]: M.sum(axis=0)
Out[54]: matrix([[0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957]])

The sparse sum is actually done with a matrix multiplication.

In [55]: np.ones(3) * M
Out[55]: array([0.59806767, 0.84567936, 0.32428732, 0.23534173, 0.33813957])

The nonzero values are:

In [56]: M.data
Out[56]: 
array([0.44828545, 0.84567936, 0.23534173, 0.14978221, 0.32428732,
       0.33813957])

though to get them by-row (or by column) requires an iteration.

In [58]: Ml = M.tolil()
In [59]: Ml.data
Out[59]: 
array([list([0.44828545291437716, 0.8456793619879996, 0.23534172969892375]),
       list([0.14978221447183726]),
       list([0.32428731688363377, 0.33813957327426203])], dtype=object)

Upvotes: 2

Related Questions