cf2
cf2

Reputation: 591

Calculate percentage of count for a list of arrays

Simple problem, but I cannot seem to get it to work. I want to calculate the percentage a number occurs in a list of arrays and output this percentage accordingly. I have a list of arrays which looks like this:

import numpy as np

# Create some data   
listvalues = []

arr1 = np.array([0, 0, 2])
arr2 = np.array([1, 1, 2, 2])
arr3 = np.array([0, 2, 2])

listvalues.append(arr1)
listvalues.append(arr2)
listvalues.append(arr3)

listvalues
>[array([0, 0, 2]), array([1, 1, 2, 2]), array([0, 2, 2])]

Now I count the occurrences using collections, which returns a a list of collections.Counter:

import collections 

counter = []
for i in xrange(len(listvalues)):
    counter.append(collections.Counter(listvalues[i]))

counter
>[Counter({0: 2, 2: 1}), Counter({1: 2, 2: 2}), Counter({0: 1, 2: 2})]

The result I am looking for is an array with 3 columns, representing the value 0 to 2 and len(listvalues) of rows. Each cell should be filled with the percentage of that value occurring in the array:

# Result
66.66    0      33.33
0        50     50
33.33    0      66.66

So 0 occurs 66.66% in array 1, 0% in array 2 and 33.33% in array 3, and so on..

What would be the best way to achieve this? Many thanks!

Upvotes: 5

Views: 12772

Answers (5)

Divakar
Divakar

Reputation: 221574

Here's an approach -

# Get lengths of each element in input list
lens = np.array([len(item) for item in listvalues])

# Form group ID array to ID elements in flattened listvalues
ID_arr = np.repeat(np.arange(len(lens)),lens)

# Extract all values & considering each row as an indexing perform counting
vals = np.concatenate(listvalues)
out_shp = [ID_arr.max()+1,vals.max()+1]
counts = np.bincount(ID_arr*out_shp[1] + vals)

# Finally get the percentages with dividing by group counts
out = 100*np.true_divide(counts.reshape(out_shp),lens[:,None])

Sample run with an additional fourth array in input list -

In [316]: listvalues
Out[316]: [array([0, 0, 2]),array([1, 1, 2, 2]),array([0, 2, 2]),array([4, 0, 1])]

In [317]: print out
[[ 66.66666667   0.          33.33333333   0.           0.        ]
 [  0.          50.          50.           0.           0.        ]
 [ 33.33333333   0.          66.66666667   0.           0.        ]
 [ 33.33333333  33.33333333   0.           0.          33.33333333]]

Upvotes: 3

frist
frist

Reputation: 1958

I would like to use functional-paradigm to resolve this problem. For example:

>>> import numpy as np
>>> import pprint
>>> 
>>> arr1 = np.array([0, 0, 2])
>>> arr2 = np.array([1, 1, 2, 2])
>>> arr3 = np.array([0, 2, 2])
>>> 
>>> arrays = (arr1, arr2, arr3)
>>> 
>>> u = np.unique(np.hstack(arrays))
>>> 
>>> result = [[1.0 * c.get(uk, 0) / l
...            for l, c in ((len(arr), dict(zip(*np.unique(arr, return_counts=True))))
...            for arr in arrays)] for uk in u]
>>> 
>>> pprint.pprint(result)
[[0.6666666666666666, 0.0, 0.3333333333333333],
 [0.0, 0.5, 0.0],
 [0.3333333333333333, 0.5, 0.6666666666666666]]

Upvotes: 0

David Hoksza
David Hoksza

Reputation: 131

You can get a list of all values and then simply iterate over the individual arrays to get the percentages:

values = set([y for row in listvalues for y in row]) print [[(a==x).sum()*100.0/len(a) for x in values] for a in listvalues]

Upvotes: 2

HolyDanna
HolyDanna

Reputation: 629

You can create a list with the percentages with the following code :

percentage_list = [((counter[i].get(j) if counter[i].get(j) else 0)*10000)//len(listvalues[i])/100.0 for i in range(len(listvalues)) for j in range(3)]

After that, create a np array from that list :

results = np.array(percentage_list)

Reshape it so we have the good result :

results = results.reshape(3,3)

This should allow you to get what you wanted.
This is most likely not efficient, and not the best way to do this, but it has the merit of working.

Do not hesitate if you have any question.

Upvotes: 0

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

The numpy_indexed package has a utility function for this, called count_table, which can be used to solve your problem efficiently as such:

import numpy_indexed as npi
arrs = [arr1, arr2, arr3]
idx = [np.ones(len(a))*i for i, a in enumerate(arrs)]
(rows, cols), table = npi.count_table(np.concatenate(idx), np.concatenate(arrs))
table = table / table.sum(axis=1, keepdims=True)
print(table * 100)

Upvotes: 2

Related Questions