syre
syre

Reputation: 982

Find common elements from multiple categories within dataframe?

This is based on the answer to a previous question.

df
year code
2009    a
2009    a
2009    b
2010    b
2010    b
2011    b
2011    c
2011    c

I want to select codes common to all years within df. Here it is "b". One solution is:

Reduce(intersect, list(unique(df$code[df$year==2009]),
                       unique(df$code[df$year==2010]),
                       unique(df$code[df$year==2011])))

In practice, df contains about 15 years, thousands of codes, millions of rows, and multiple columns. For starters, the above command becomes quite long when all the years are included. Plus it's memory-consuming and slow. Is there sparser/faster code to do this?

Upvotes: 0

Views: 119

Answers (2)

alexis_laz
alexis_laz

Reputation: 13122

As another idea, you could work on a structure of occurences per year that can be handy and more efficient down the road instead of many pairwise intersections:

lvls = list(y = unique(df$year), c = levels(df$code))

library(Matrix)
tab = sparseMatrix(i = match(df$year, lvls$y), 
                   j = match(df$code, lvls$c), 
                   x = TRUE, 
                   dimnames = lvls)

tab
#3 x 3 sparse Matrix of class "lgCMatrix"
#      c
#y      a b c
#  2009 | | .
#  2010 . | .
#  2011 . | |

And, then, :

colSums(tab) == nrow(tab)
#    a     b     c 
#FALSE  TRUE FALSE

or, in this case, better:

colnames(tab)[diff(tab@p) == nrow(tab)]
#[1] "b"

"df" is:

df = structure(list(year = c(2009L, 2009L, 2009L, 2010L, 2010L, 2011L, 
2011L, 2011L), code = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 
3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("year", 
"code"), class = "data.frame", row.names = c(NA, -8L))

Upvotes: 2

Aramis7d
Aramis7d

Reputation: 2496

Using tidyverse functions and considerng dft1 as your input, you can try:

dft1 %>%
  unique() %>%
  group_by(code) %>%
  filter( n_distinct(year) == length(unique(dft1$year)))

which gives:

   year  code
  <int> <chr>
1  2009     b
2  2010     b
3  2011     b

Upvotes: 0

Related Questions