Reputation: 57
I have a data frame with some data about different people. They look like this:
Year Item ID
2005 a 1234
2005 b 1234
2005 a 4567
2005 b 4567
2006 a 4567
2006 a 7894
My data has 45000 observations and about 1000 different ID's and 10 different years. I want to find the ID's of people that appear in more than 1 year, how do i do this? I thought of separating the data by ID and seeing if the resulting data has different years, but that doesn´t seem like the smartest way to do that
Upvotes: 2
Views: 37
Reputation: 13319
We can get the duplicated IDs and then get those duplicated within these:
Dups<-df[duplicated(df$ID),]
Dups[duplicated(Dups$ID),]["ID"]
# ID
# 5 4567
Upvotes: 1
Reputation: 32548
split
the Year
by ID
and then keep only those sub-groups that have more than one unique Year
list1 = lapply(split(df1$Year, df1$ID), unique)
list1 = list1[lengths(list1) > 1]
data.frame(ID = names(list1), count = lengths(list1))
# ID count
#4567 4567 2
#DATA
df1 = structure(list(Year = c(2005L, 2005L, 2005L, 2005L, 2006L, 2006L),
Item = c("a", "b", "a", "b", "a", "a"), ID = c(1234L, 1234L, 4567L, 4567L, 4567L, 7894L)),
class = "data.frame",
row.names = c(NA, -6L))
Upvotes: 1
Reputation: 388982
With dplyr
we can use n_distinct
and get only those ID
's which have more than 1 year.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Year) > 1) %>%
pull(ID) %>%
unique
#[1] 4567
A base R alternative with table
unique(df$ID)[rowSums(table(df$ID, df$Year) > 0) > 1]
#[1] 4567
Upvotes: 1