Reputation: 1258
I've got the following table (which is called train) (in reality much bigger)
UNSPSC adaptor alert bact blood collection packet patient ultrasoft whit
514415 1 0 1 0 0 0 0 1 0
514415 0 0 0 1 1 0 0 1 0
514415 0 0 1 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
514415 1 0 1 0 0 0 0 1 0
514415 0 0 0 0 0 0 0 1 0
422018 1 0 1 0 0 0 0 1 0
422018 0 0 0 0 0 0 0 1 0
422018 0 0 0 1 0 0 0 1 0
411011 0 0 0 0 0 0 0 1 0
and the following table which I'll call associations:
lhd rhs
blood collection
adaptor bact
[...]
I want to calculate the number of unique UNSPSC per column where the value is equal to 1 per each record in the association table for both lhs and rhs. Like:
blood collection 1 adaptor bact 2
This code does it for just one term at the time.
apply(train[,-1], 2, function(x) length(unique(substr(train$UNSPSC,1,4)[x == 1])))
Upvotes: 1
Views: 44
Reputation: 887118
A similar option with tidyverse
would be (data from @PoGibas post) to apply pmap
on the 'associations' data to loop over the columns, filter
the 'train' where the columns are both 1, pull
the 'UNSCPSC' column and get the length
of unique
elements (n_distinct
)
library(tidyverse)
pmap_int(associations, ~ train %>%
filter(!! rlang::sym(.x) == 1, !! rlang::sym(.y) == 1) %>%
pull(UNSPSC) %>%
n_distinct)
#[1] 3 1 2
Upvotes: 2
Reputation: 28339
Instead of iterating over trains
you can iterate over associations
and use subset
(subsets where x
row column 1 and 2 are equal to 1
), unique
, length
functions.
Columns from row x
are called using get
function.
train$lhd <- 1
train$rhs <- 1
apply(associations, 1, function(x)
length(unique(subset(train, get(x[1]) == 1 & get(x[2]) == 1)$UNSPSC))
)
# [1] 3 1 2
Data (train
):
structure(list(UNSPSC = c(514415L, 514415L, 514415L, 514415L,
514415L, 514415L, 422018L, 422018L, 422018L, 411011L), adaptor = c(1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L), alert = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L), bact = c(1L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L), blood = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L
), collection = c(0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), packet = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), patient = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), ultrasoft = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), whit = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L), lhd = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), rhs = c(1, 1,
1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("UNSPSC", "adaptor", "alert",
"bact", "blood", "collection", "packet", "patient", "ultrasoft",
"whit", "lhd", "rhs"), row.names = c(NA, -10L), class = "data.frame")
Data (associations
):
structure(list(V1 = c("lhd", "blood", "adaptor"), V2 = c("rhs",
"collection", "bact")), .Names = c("V1", "V2"), row.names = c(NA,
-3L), class = "data.frame")
Upvotes: 3