Elhanan Schwarts
Elhanan Schwarts

Reputation: 383

numpy matrix count distinct

Assume there is a numpy contains the following data structure:

import numpy as np


a = np.array([['2','W','A'],
             ['3', 'R', 'A'],
             ['4', 'W', 'R'],
             ['2', 'E', 'R'],
             ['4', 'E', 'Y'],
             ['3', 'E', 'Y']])
  1. Need to summarize the appearance number of unique instances in the third column and return numpy so that it returns the following result:
[['A' '2']
 ['R' '2']
 ['Y' '2']]

(For example the value of A appears in the third column twice, so the result will be 'A' '2'.)

  1. Similarly, sum the unique values found in both the second and third columns and return numpy in the following structure:
[['W' 'A' '1']
 ['R' 'A' '1']
 ['W' 'R' '1']
 ['E' 'R' '1']
 ['E' 'Y' '2']

For example the value of E in the second column together with the value of Y in the third column appears twice, so the result will be 'E' 'Y' '2'.

Upvotes: 1

Views: 710

Answers (3)

anon01
anon01

Reputation: 11181

(setup code)

import pandas as pd
import numpy as np

a = np.array(
   [['2', 'W', 'A'],
   ['3', 'R', 'A'],
   ['4', 'W', 'R'],
   ['2', 'E', 'R'],
   ['4', 'E', 'Y'],
   ['3', 'E', 'Y']],    
)

Pandas is well suited for this and leverages numpy on the backend. For example, you can get the second, third column counts like this:

df = pd.DataFrame(a)    
cols = [1,2]
df[cols].value_counts().astype("str").reset_index().values

result:

array([['E', 'Y', '2'],
       ['W', 'R', '1'],
       ['W', 'A', '1'],
       ['R', 'A', '1'],
       ['E', 'R', '1']], dtype=object)

Upvotes: 1

Gulzar
Gulzar

Reputation: 28074

Use numpy's count occurances, then reformat

import numpy as np


a = np.array([['2','W','A'],
             ['3', 'R', 'A'],
             ['4', 'W', 'R'],
             ['2', 'E', 'R'],
             ['4', 'E', 'Y'],
             ['3', 'E', 'Y']])

unique, counts = np.unique(a[:, 2], return_counts=True)
result = np.vstack([unique, counts]).T
print(result)

As for the second qurstion:
If you want to avoid for loops or list comprehensions, and stick to plain numpy, and are willing to give up your exact formatting for the output, you can do

ind_col = np.core.defchararray.add(a[:, 1], a[:, 2])
unique, counts = np.unique(ind_col, return_counts=True)
result1 = np.vstack([unique, counts]).T

print(result1)
[['ER' '1']
 ['EY' '2']
 ['RA' '1']
 ['WA' '1']
 ['WR' '1']]

Upvotes: 1

StupidWolf
StupidWolf

Reputation: 46978

For the first, you can use np.unique(), and specify axis = 0 to tabulate the rows:

def tabulate_array(x,columns):
    idx,counts = np.unique(x[:,columns],return_counts=True,axis=0)
    return [list(idx[i]) + list(str(counts[i])) for i in range(len(counts))]

The last part to concatenate the counts and list might be refined a bit more but for now it will give you the string output, for example:

tabulate_array(a,[2])
[['A', '2'], ['R', '2'], ['Y', '2']]

tabulate_array(a,[1,2])
[['E', 'R', '1'],
 ['E', 'Y', '2'],
 ['R', 'A', '1'],
 ['W', 'A', '1'],
 ['W', 'R', '1']]

Upvotes: 1

Related Questions