Reputation: 2656
So my NumPy array looks like this
npfinal =
[[1, 3, 5, 0, 0, 0],
[5, 2, 4, 0, 0, 0],
[7, 7, 2, 0, 0, 0],
.
.
.
Sample dataset I'm working with is 25k rows.
The first 3 columns contain meaningful data, rest are placeholders for the percentiles.
So I need the percentile of a[0][0] with respect to the entire first column in a[0][3]. So 1's percentile score wrt the column [1,5,7,...]
My first attempt was:
import scipy.stats as ss
...
numofcols = 3
for row in npfinal:
for i in range(0,numofcols):
row[i+numofcols] = int(round(ss.percentileofscore(npfinal[:,i], row[i])))
But this is taking way too much time; and on a full dataset it'll be impossible.
I'm new to the world of computing on such large datasets so any sort of help will be appreciated.
Upvotes: 1
Views: 1715
Reputation: 2416
I found a solution that I believe it works better when there are repeated values in the array:
import numpy as np
from scipy import stats
# some array with repeated values:
M = np.array([[1, 7, 2], [5, 2, 2], [5, 7, 2]])
# calculate percentiles applying scipy rankdata to each column:
percentile = np.apply_along_axis(sp.stats.rankdata, 0, M, method='average')/len(M)
The np.argsort solution has the problem that it gives different percentiles to repetitions of the same value. For example if you had:
percentile_argsort = np.argsort(np.argsort(M, axis=0), axis=0) / float(len(M)) * 100
percentile_rankdata = np.apply_along_axis(sp.stats.rankdata, 0, M, method='average')/len(M)
the two different approaches will output the results:
M
array([[1, 7, 2],
[5, 2, 2],
[5, 7, 2]])
percentile_argsort
array([[ 0. , 33.33333333, 0. ],
[ 33.33333333, 0. , 33.33333333],
[ 66.66666667, 66.66666667, 66.66666667]])
percentile_rankdata
array([[ 0.33333333, 0.83333333, 0.66666667],
[ 0.83333333, 0.33333333, 0.66666667],
[ 0.83333333, 0.83333333, 0.66666667]])
Upvotes: 2
Reputation: 17942
You might be able to compute the percentile by sorting the array and dividing the resulting index by the total number of rows (assuming NumPy is available):
import numpy as np
M = np.array([[1, 3, 5], [5, 2, 4], [7, 7, 2]])
percentile = np.argsort(np.argsort(M, axis=0), axis=0) / float(len(M)) * 100
print "M:\n", M
print "percentile:\n", percentile
Output:
M:
[[1 3 5]
[5 2 4]
[7 7 2]]
percentile:
[[ 0. 33.33333333 66.66666667]
[ 33.33333333 0. 33.33333333]
[ 66.66666667 66.66666667 0. ]]
Now you only need to concatenate the result and your original array.
Upvotes: 1