JeanVuda
JeanVuda

Reputation: 1778

Finding common elements in a list

Let's say I have 3 character vectors. I want to do some evaluations on them like comparing whether an element in a vector is also found in other vectors. I wouldn't know which vector is the shortest, so I want to compute it programmatically.

For example:

a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')

l1 <- list(a,b,c)
#print(l1)
l2 <- sapply(l1,length)
#print(l2)

pos <- which(l2==min(l2))
shortest <- l1[pos]
#print(shortest)

a1 <- l1[!seq(1,3) %in% pos][1]
a2 <- l1[!seq(1,3) %in% pos][2]
#print(a1)
#print(a2)

shortest[[1]][sapply(shortest,function(x) !x %in% unlist(c(a1,a2)))[,1]]

I want to find the element that is in the shortest element, but not found in the other two elements. In this example, I want to get the 'Type' as a result. I am also having issue with a situation where two elements meet minimum length (in this example lengths were 2,3,3 but I want to handle 2,2,3. Would appreciate some help. I need to run this over 11000 lists like l1, and my vector lengths are 20 minimum.

Upvotes: 1

Views: 76

Answers (3)

Martin Morgan
Martin Morgan

Reputation: 46856

Tidy your data into a vector of observations and a grouping variable, coordinated in a data.frame

df = data.frame(
    word = unlist(l1),
    group = rep(seq_along(l1), lengths(l1)),
    stringsAsFactors = FALSE
)

(lengths() is a more efficient way to implement sapply(x, length)).

Manipulate the data with information you need -- the length of each group, and the count of each word

df = cbind(df,
    word_count = as.vector(table(df$word)[df$word]),
    group_length = tabulate(df$group)[df$group]
)

Model your desired result by ordering the rows first by word count then by group length

df[order(df$word_count, df$group_length),]

The answer is the first row

> df[order(df$word_count, df$group_length),]
    word group word_count group_length
2   Type     1          1            2
4    Age     2          1            3
5   Meta     2          1            3
6     ID     3          1            3
7 Gender     3          1            3
8  Color     3          1            3
1   Name     1          2            2
3   Name     2          2            3

Handle ties using a different metric to model your data; how to implement that depends on the model you wish to use.

This is essentially the same answer as @hpesoj626 with the 'tidy' step

tidy <- l1 %>% enframe() %>% unnest()

The 'manipulate' step

manip <- tidy %>%
  group_by(name) %>% mutate(list_n = n()) %>% ungroup() %>%
  group_by(value) %>% mutate(not_in = n()) %>% ungroup()

and the 'model' step

manip %>% filter(list_n == min(list_n) & not_in == 1) %>%
  select(-list_n, -not_in)

Upvotes: 2

Soren
Soren

Reputation: 2425

Please see some modifications to your original post, including to add a vendor 'd' that also has two elements and therefore ties as shortest along with your original vector 'a'. If I understood your need, that in the case of vectors that tie for shortest, they'd return the non-matching element in all other elements that don't match those that tie for shortest (that is, in this example, you don't want to compare 'a' and 'd' since they're both tied for shortest; rather you want to compare these to 'b' and 'c').

The solution below uses the setdiff() function to identify and return differences. It also groups all not-shortest vectors into a single vector of unique elements to compare all at once, rather than to iterate over each of the not-shortest vectors individually.

a <- c('Name','Type')
b <- c('Name','Age','Meta')
c <- c('ID','Gender','Color')
d <- c('Name','Reason')

l1 <- list(a,b,c,d)
l2 <- sapply(l1,length)

pos <- which(l2==min(l2))
shortest <- l1[pos]

#All the lists that are not the shortest ones
not_shortest <- l1[-pos]

#Collapse all the lists we want to search through into a single vector of unique elements
all_not_shortest <- unique(unlist(not_shortest))

#All of the shortest vectors (here 'a' and 'd' tie for shortest) compare their element differences to the entire set of all elements in not shortest vectors
lapply(shortest,setdiff,all_not_shortest)

Upvotes: 1

hpesoj626
hpesoj626

Reputation: 3619

One way is to form a data frame of the elements of the list then filter by the least number of elements and the least frequency of the word. This will also capture the instances of more than one unique word that are in the same vector.

library(tidyverse)
l1 %>% enframe() %>% unnest() %>%
  group_by(name) %>%
  mutate(list_n = n()) %>%
  ungroup() %>%
  group_by(value) %>%
  mutate(not_in = n()) %>%
  ungroup() %>%
  filter(list_n == min(list_n) & not_in == 1) %>%
  select(-list_n, -not_in)

# # A tibble: 1 x 2
#    name value
#   <int> <chr>
# 1     1 Type 

Upvotes: 2

Related Questions