Dario Federici
Dario Federici

Reputation: 1258

Count unique instances in a column given the names of two other columns

I've got the following table (which is called train) (in reality much bigger)

UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
 514415       1     0    1     0          0      0       0         1    0
 514415       0     0    0     1          1      0       0         1    0
 514415       0     0    1     0          0      0       0         1    0
 514415       0     0    0     0          0      0       0         1    0
 514415       1     0    1     0          0      0       0         1    0
 514415       0     0    0     0          0      0       0         1    0
 422018       1     0    1     0          0      0       0         1    0
 422018       0     0    0     0          0      0       0         1    0
 422018       0     0    0     1          0      0       0         1    0
 411011       0     0    0     0          0      0       0         1    0

and the following table which I'll call associations:

 lhd     rhs
blood   collection
adaptor bact
[...]

I want to calculate the number of unique UNSPSC per column where the value is equal to 1 per each record in the association table for both lhs and rhs. Like:

blood collection 1 adaptor bact 2

This code does it for just one term at the time.

apply(train[,-1], 2, function(x) length(unique(substr(train$UNSPSC,1,4)[x == 1])))

Upvotes: 1

Views: 44

Answers (2)

akrun
akrun

Reputation: 887118

A similar option with tidyverse would be (data from @PoGibas post) to apply pmap on the 'associations' data to loop over the columns, filter the 'train' where the columns are both 1, pull the 'UNSCPSC' column and get the length of unique elements (n_distinct)

library(tidyverse)
pmap_int(associations, ~ train %>% 
                           filter(!! rlang::sym(.x) == 1, !! rlang::sym(.y) == 1) %>% 
                           pull(UNSPSC) %>% 
                           n_distinct)
#[1] 3 1 2

Upvotes: 2

pogibas
pogibas

Reputation: 28339

Instead of iterating over trains you can iterate over associations and use subset (subsets where x row column 1 and 2 are equal to 1), unique, length functions.
Columns from row x are called using get function.

train$lhd <- 1
train$rhs <- 1
apply(associations, 1, function(x)
    length(unique(subset(train, get(x[1]) == 1 & get(x[2]) == 1)$UNSPSC))
)
# [1] 3 1 2

Data (train):

structure(list(UNSPSC = c(514415L, 514415L, 514415L, 514415L, 
514415L, 514415L, 422018L, 422018L, 422018L, 411011L), adaptor = c(1L, 
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L), alert = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L), bact = c(1L, 0L, 1L, 0L, 1L, 0L, 1L, 
0L, 0L, 0L), blood = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L
), collection = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), packet = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), patient = c(0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), ultrasoft = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), whit = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L), lhd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), rhs = c(1, 1, 
1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("UNSPSC", "adaptor", "alert", 
"bact", "blood", "collection", "packet", "patient", "ultrasoft", 
"whit", "lhd", "rhs"), row.names = c(NA, -10L), class = "data.frame")

Data (associations):

structure(list(V1 = c("lhd", "blood", "adaptor"), V2 = c("rhs", 
"collection", "bact")), .Names = c("V1", "V2"), row.names = c(NA, 
-3L), class = "data.frame")

Upvotes: 3

Related Questions