Reputation: 1778
Let's say I have 3 character vectors. I want to do some evaluations on them like comparing whether an element in a vector is also found in other vectors. I wouldn't know which vector is the shortest, so I want to compute it programmatically.
For example:
a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')
l1 <- list(a,b,c)
#print(l1)
l2 <- sapply(l1,length)
#print(l2)
pos <- which(l2==min(l2))
shortest <- l1[pos]
#print(shortest)
a1 <- l1[!seq(1,3) %in% pos][1]
a2 <- l1[!seq(1,3) %in% pos][2]
#print(a1)
#print(a2)
shortest[[1]][sapply(shortest,function(x) !x %in% unlist(c(a1,a2)))[,1]]
I want to find the element that is in the shortest element, but not found in the other two elements. In this example, I want to get the 'Type' as a result. I am also having issue with a situation where two elements meet minimum length (in this example lengths were 2,3,3
but I want to handle 2,2,3
. Would appreciate some help. I need to run this over 11000 lists like l1, and my vector lengths are 20 minimum.
Upvotes: 1
Views: 76
Reputation: 46856
Tidy your data into a vector of observations and a grouping variable, coordinated in a data.frame
df = data.frame(
word = unlist(l1),
group = rep(seq_along(l1), lengths(l1)),
stringsAsFactors = FALSE
)
(lengths()
is a more efficient way to implement sapply(x, length)
).
Manipulate the data with information you need -- the length of each group, and the count of each word
df = cbind(df,
word_count = as.vector(table(df$word)[df$word]),
group_length = tabulate(df$group)[df$group]
)
Model your desired result by ordering the rows first by word count then by group length
df[order(df$word_count, df$group_length),]
The answer is the first row
> df[order(df$word_count, df$group_length),]
word group word_count group_length
2 Type 1 1 2
4 Age 2 1 3
5 Meta 2 1 3
6 ID 3 1 3
7 Gender 3 1 3
8 Color 3 1 3
1 Name 1 2 2
3 Name 2 2 3
Handle ties using a different metric to model your data; how to implement that depends on the model you wish to use.
This is essentially the same answer as @hpesoj626 with the 'tidy' step
tidy <- l1 %>% enframe() %>% unnest()
The 'manipulate' step
manip <- tidy %>%
group_by(name) %>% mutate(list_n = n()) %>% ungroup() %>%
group_by(value) %>% mutate(not_in = n()) %>% ungroup()
and the 'model' step
manip %>% filter(list_n == min(list_n) & not_in == 1) %>%
select(-list_n, -not_in)
Upvotes: 2
Reputation: 2425
Please see some modifications to your original post, including to add a vendor 'd' that also has two elements and therefore ties as shortest along with your original vector 'a'. If I understood your need, that in the case of vectors that tie for shortest, they'd return the non-matching element in all other elements that don't match those that tie for shortest (that is, in this example, you don't want to compare 'a' and 'd' since they're both tied for shortest; rather you want to compare these to 'b' and 'c').
The solution below uses the setdiff() function to identify and return differences. It also groups all not-shortest vectors into a single vector of unique elements to compare all at once, rather than to iterate over each of the not-shortest vectors individually.
a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')
d <- c('Name','Reason')
l1 <- list(a,b,c,d)
l2 <- sapply(l1,length)
pos <- which(l2==min(l2))
shortest <- l1[pos]
#All the lists that are not the shortest ones
not_shortest <- l1[-pos]
#Collapse all the lists we want to search through into a single vector of unique elements
all_not_shortest <- unique(unlist(not_shortest))
#All of the shortest vectors (here 'a' and 'd' tie for shortest) compare their element differences to the entire set of all elements in not shortest vectors
lapply(shortest,setdiff,all_not_shortest)
Upvotes: 1
Reputation: 3619
One way is to form a data frame of the elements of the list then filter
by the least number of elements and the least frequency of the word. This will also capture the instances of more than one unique word that are in the same vector.
library(tidyverse)
l1 %>% enframe() %>% unnest() %>%
group_by(name) %>%
mutate(list_n = n()) %>%
ungroup() %>%
group_by(value) %>%
mutate(not_in = n()) %>%
ungroup() %>%
filter(list_n == min(list_n) & not_in == 1) %>%
select(-list_n, -not_in)
# # A tibble: 1 x 2
# name value
# <int> <chr>
# 1 1 Type
Upvotes: 2