Reputation: 982
This is based on the answer to a previous question.
df
year code
2009 a
2009 a
2009 b
2010 b
2010 b
2011 b
2011 c
2011 c
I want to select codes common to all years within df. Here it is "b". One solution is:
Reduce(intersect, list(unique(df$code[df$year==2009]),
unique(df$code[df$year==2010]),
unique(df$code[df$year==2011])))
In practice, df contains about 15 years, thousands of codes, millions of rows, and multiple columns. For starters, the above command becomes quite long when all the years are included. Plus it's memory-consuming and slow. Is there sparser/faster code to do this?
Upvotes: 0
Views: 119
Reputation: 13122
As another idea, you could work on a structure of occurences per year that can be handy and more efficient down the road instead of many pairwise intersections:
lvls = list(y = unique(df$year), c = levels(df$code))
library(Matrix)
tab = sparseMatrix(i = match(df$year, lvls$y),
j = match(df$code, lvls$c),
x = TRUE,
dimnames = lvls)
tab
#3 x 3 sparse Matrix of class "lgCMatrix"
# c
#y a b c
# 2009 | | .
# 2010 . | .
# 2011 . | |
And, then, :
colSums(tab) == nrow(tab)
# a b c
#FALSE TRUE FALSE
or, in this case, better:
colnames(tab)[diff(tab@p) == nrow(tab)]
#[1] "b"
"df" is:
df = structure(list(year = c(2009L, 2009L, 2009L, 2010L, 2010L, 2011L,
2011L, 2011L), code = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L), .Label = c("a", "b", "c"), class = "factor")), .Names = c("year",
"code"), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 2
Reputation: 2496
Using tidyverse
functions and considerng dft1
as your input, you can try:
dft1 %>%
unique() %>%
group_by(code) %>%
filter( n_distinct(year) == length(unique(dft1$year)))
which gives:
year code
<int> <chr>
1 2009 b
2 2010 b
3 2011 b
Upvotes: 0