uhClem
uhClem

Reputation: 147

R -- Finding absent elements in a repeated series

How do you find entries that should (predictably) be present in a dataframe but don't exist? My probably trivial problem is similar to this question but with an extra layer or two -- every solution I've tried fails against the repetitions and irregularities in the set.

# Over some years, with a few skipped...
yrs <- 1985:2018 
yrs <- yrs[-sample(1:length(yrs),5)]
# at a number of plots, replicated over two transects...
pts <- letters[1:10]
lns <- c('N','S')
# a study was done (data fields omitted):
yrsr <- rep(yrs, each= length(lns)*length(pts))
lnsr <- rep(rep(lns, each=length(pts)), times=length(yrs))
ptsr <- rep(rep(pts, times=length(lns)), times=length(yrs))
study <- data.frame(YEAR=yrsr, LINE=lnsr, PLOT=ptsr)
## But for random reasons certain plots got left out.
studym <- study[-sample(1:nrow(study), 23),]
# NB: The number of entries per plot varies:
studym$SPEC <- sample(c(1,1,1,1,1,2,2,2,3), nrow(studym), replace=TRUE)
studyAll <- studym[rep(row.names(studym), studym$SPEC),]

Missed plots might have been legitimate zeros or data entry errors or whatever; they need to be tracked down and either corrected or inserted as NAs. So to find them on the original data sheets I need a list of... all the elements that don't exist in studyAll... From my run here that would be something like

# 1985 N d
# 1985 N g
# ...
# 2017 S g

But since they don't exist I'm having trouble figuring out what to ask for, and where from. I haven't been able to figure out any joins to do what I want. I got a tantalizing summary with this:

studyAll %>% group_by(YEAR, LINE) %>% count(PLOT) %>% apply(2, table)

but that just tells me how much of each problem I've got and not where to find it.

(Bonus feeb question: Is there a way to construct study more directly from yrs, pts and lns, without those three lines of rep()? I figure there must be some way to generate a simple hierarchy like that, but couldn't find it.)

Upvotes: 0

Views: 45

Answers (1)

user666993
user666993

Reputation:

One way to find missing data in a factorial design is to generate all combinations of YEAR, LINE AND PLOT from studyAll, and then find the difference between all combinations and the recorded observations in your studyAll data.frame via the anti_join.

library("tidyr")
library("dplyr")

 studyMissing <- studyAll %>%
   expand(YEAR, LINE, PLOT) %>%
   anti_join(studyAll, by = c("YEAR", "LINE", "PLOT"))

# Giving
# A tibble: 23 x 3 
#    YEAR LINE  PLOT 
#   <int> <fct> <fct>
# 1  1985 N     f    
# 2  1986 N     h    
# 3  1986 S     g    
# 4  1992 N     h    
# 5  1996 S     g    
# 6  2001 N     e    
# 7  2001 N     i    
# 8  2002 N     c    
# 9  2002 S     g    
#10  2003 N     h    
## ... with 13 more rows

Upvotes: 1

Related Questions