Reputation: 1865
I have a numpy array as follows:
array = np.random.randint(6, size=(50, 400))
This array has the cluster that each value belongs to, with each row representing a sample and each column representing a feature, but I would like to create a 5 dimensional array with the frequency of each cluster (in each sample, represented as a row in this matrix).
However, in the frequency calculation, I want to ignore 0, meaning that the frequency of all values except 0 (1-5) should add to 1.
Essentially what I want is a array with each row being a cluster (1-5) in this case, and each row still contains a single sample.
How can this be done?
Edit:
small input:
input = np.random.randint(6, size=(2, 5))
array([[0, 4, 2, 3, 0],
[5, 5, 2, 5, 3]])
output:
1 2 3 4 5
0 .33 .33 .33 0
0 .2 .2 0 .6
Where 1-5 are the row names, and the bottom two rows are the desired output in a numpy array.
Upvotes: 1
Views: 107
Reputation: 19770
This is a simple application of bincount. Does this do what you want?
def freqs(x):
counts = np.bincount(x, minlength=6)[1:]
return counts/counts.sum()
frequencies = np.apply_along_axis(freqs, axis=1, arr=array)
If you were wondering about the speed implications of apply_along_axis
, this method using tricky indexing is marginally slower in my tests:
counts = (array[:, :, None] == values[None, None, :]).sum(axis=1)
frequencies2 = counts/counts.sum(axis=1)[:, None]
Upvotes: 4