Reputation: 45
I am relatively new to R and I am trying to filter a data frame by year and specific column numbers. Here's a toy example of my data set:
Year UniqueID Bench.St SiteEUI
2011 1 Yes 450
2011 2 No 300
2011 3 No NA
2011 4 NA 350
2012 1 No 400
2012 2 Yes 200
2013 1 Yes 500
2013 2 No 100
2013 3 Yes 475
I am trying to extract the rows which have repeated information from 2011 to 2013 sorted through the UniqueIDs. Using the example above, 2011 has 4 UniqueIDs (1,2,3,4), 2012 has 2 UniqueIDs(1,2) and 2013 has 3 UniqueIDs(1,2,3). Since UniqueIDs 1 and 2 are present in the three years, I would like to extract the UniqueID rows for those years. Therefore the dataset above reduces to:
Year UniqueID Bench.St SiteEUI
2011 1 Yes 450
2011 2 No 300
2012 1 No 400
2012 2 Yes 200
2013 1 Yes 500
2013 2 No 100
I believe that dplyr or some other simple function may be able to do this but I am not sure how to go about it. Thanks!
Upvotes: 3
Views: 1108
Reputation: 38520
I think what you are asking is how to extract the set of observational units that are present in all years of your data. Here is one method using base R for a data.frame named dataSet:
# get a table of the frequency counts of each ID
idCount <- table(dataSet$uniqueIDs)
# keep the IDs (converted back to integers) that have the most counts
keepIDs <- as.integer(names(idCount))[idCount == max(idCount)]
# save the new data set that includes those IDs
newDataSet <- dataSet[dataSet$uniqueIDs %in% keepIDs,]
Upvotes: 3
Reputation: 5308
You could split
the dataset by years and blend the resulting annual list entries back together using merge
along with all = FALSE
(default). Like that, you end up with those values of 'uniqueID' that exist during all years under investigation.
## sample data
dat <- data.frame(Year = c(rep(2011, 4), rep(2012, 2), rep(2013, 3)),
UniqueID = c(1, 2, 3, 4, 1, 2, 1, 2, 3),
Bench.St = c("Yes", "No", "No", NA, "No", "Yes", "Yes", "No", "Yes"),
SiteEUI = c(450, 300, NA, 350, 400, 200, 500, 100, 475))
## split data by year and merge by 'uniqueID', discard non-matching entries
lst <- split(dat, dat$Year)
mrg <- Reduce(function(...) merge(..., by = "UniqueID"), lst)
## subset data based on remaining values of 'uniqueID'
dat[dat$UniqueID %in% mrg$UniqueID, ]
Year UniqueID Bench.St SiteEUI
1 2011 1 Yes 450
2 2011 2 No 300
5 2012 1 No 400
6 2012 2 Yes 200
7 2013 1 Yes 500
8 2013 2 No 100
Upvotes: 1