Reputation: 763
I have some (a lot) binary encoded vectors like:
[0, 1, 0, 0, 1, 0] #But with many more elements each one
and they are all stored into a numpy (2D) array like:
[
[0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 1, 0],
]
I want to get a frequency table of each label set. So, in this example, the frequency table will be:
[2,1]
Because the 1st label set has two appearances and the 2nd label set just one.
In other words, I want to implement itemfreq from Scipy or histogram from numpy, but not for single elements but for lists.
Now I have the following code implemented:
def get_label_set_freq_table(labels):
uniques = np.empty_like(labels)
freq_table = np.zeros(shape=labels.shape[0])
equal = False
for idx,row in enumerate(labels):
for lbl_idx,label_set in enumerate(uniques):
if np.array_equal(row,label_set):
equal = True
freq_table[lbl_idx] += 1
break
if not equal:
uniques[idx] = row
freq_table[idx] += 1
equal = False
return freq_table
being labels the binary encoded vectors.
It works well, but it's extremly low when the number of vectors is big (>58.000) and the number of elements in each one is also big (>8.000)
How can this be done in a more efficient way?
Upvotes: 2
Views: 487
Reputation: 221574
I am assuming you meant an array with 1s and 0s only. For those, we can reduce each row to a scalar with binary scaling and then use np.unique
-
In [52]: a
Out[52]:
array([[0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 1, 0]])
In [53]: s = 2**np.arange(a.shape[1])
In [54]: a1D = a.dot(s)
In [55]: _, start, count = np.unique(a1D, return_index=1, return_counts=1)
In [56]: a[start]
Out[56]:
array([[0, 1, 0, 0, 1, 0],
[0, 0, 1, 0, 0, 1]])
In [57]: count
Out[57]: array([2, 1])
Here's a generalized one -
In [33]: unq_rows, freq = np.unique(a, axis=0, return_counts=1)
In [34]: unq_rows
Out[34]:
array([[0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 1, 0]])
In [35]: freq
Out[35]: array([1, 2])
Upvotes: 2