Dario Federici
Dario Federici

Reputation: 1258

Count number of unique instances in a column depending on values in other columns

I've got the following table (which is called train) (in reality much bigger)

 UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
 514415       0     0    0     0          0      0       0         1    0
 514415       0     0    0     1          0      0       0         1    0
 514415       0     0    1     0          0      0       0         1    0
 514415       0     0    0     0          0      0       0         1    0
 514415       0     0    0     0          0      0       0         1    0
 514415       0     0    0     0          0      0       0         1    0
 422018       0     0    0     0          0      0       0         1    0
 422018       0     0    0     0          0      0       0         1    0
 422018       0     0    0     1          0      0       0         1    0
 411011       0     0    0     0          0      0       0         1    0

I want to calculate the number of unique UNSPSC per column where the value is equal to 1. So for column blood it will be 2 and for column ultrasoft will be 3.

I'm doing this but don't know how to continue:

apply(train[,-1], 2, ......)

I'm trying to not to use loops.

Upvotes: 1

Views: 97

Answers (2)

Fer Arce
Fer Arce

Reputation: 149

If you have columns of only 0 and 1, like in the example, just use colSums:

colSums(train[,-1]) # you remove the non numeric columns before use, like UNSPSC

# adaptor      alert       bact      blood collection     packet    patient 
#         0          0          1          2          0          0          0 
# ultrasoft       whit
#        10          0

Upvotes: -1

Ronak Shah
Ronak Shah

Reputation: 388982

To continue from where you left, we can use apply with margin=2 and calculate the length of unique values of "UNSPSC" for each column.

apply(train[-1], 2, function(x) length(unique(train$UNSPSC[x==1])))

#adaptor      alert       bact      blood collection     packet 
#     0          0          1          2          0          0 
#patient  ultrasoft       whit 
#     0          3          0 

Better option is with sapply/lapply which gives the same result but unlike apply does not convert the dataframe into matrix.

sapply(train[-1], function(x) length(unique(train$UNSPSC[x==1])))

Upvotes: 4

Related Questions