user697473
user697473

Reputation: 2293

tidyverse: removing rows from data frame on basis of values in other rows

I have a tibble of survey data. Each row represents an interview with a "respondent." Some respondents were interviewed once; others were interviewed multiple times. I want to select, from this tibble, only the last interview for each respondent.

Here is a minimal example:

tmp <- tribble(
  ~YYYYMM, ~ID, ~DATEPR, ~IDPREV,
   198001,   1,      NA,      NA,
   198001,   2,      NA,      NA,
   198001,   3,      NA,      NA,
   198002,   1,      198001,   1,
   198002,   2,      NA,      NA,
   198002,   3,      NA,      NA,
   198003,   1,      198002,   1,
   198003,   2,      NA,      NA,
   198003,   3,      198002,   3)

where

There are nine rows in the tibble above. But one respondent was interviewed three times, and another was interviewed twice. I want only the last interview for each respondent, so I want a tibble with only six rows. This code does the job:

for (i in 1:nrow(tmp)) {
  if (!is.na(tmp$DATEPR[i])) {
    ind <- which(tmp$YYYYMM == tmp$DATEPR[i] & tmp$ID == tmp$IDPREV[i])
    tmp <- tmp[-ind, ]
  }
}

But it seems a little hard to parse. Is there a clearer way to achieve the same end with the tidyverse functions? I have in mind a two–step function: first, get the indices of all rows to remove; second, remove the rows. But I haven't been able to implement this solution with map or with the dplyr functions.

Upvotes: 1

Views: 2804

Answers (1)

akuiper
akuiper

Reputation: 214957

If all previously interviewed IDs are listed on the 3rd and 4th columns, you can do a left anti_join of the data frame with itself using dplyr::anti_join, here you match DATEPR and IDPREV with the YYYYMM and ID respectively, only rows from temp whose YYYYMM and ID that don't have a match will be left:

anti_join(tmp, tmp, by = c("YYYYMM" = "DATEPR", "ID" = "IDPREV")) %>% 
    arrange(YYYYMM, ID)

# A tibble: 6 x 4
#  YYYYMM    ID DATEPR IDPREV
#   <dbl> <dbl>  <dbl>  <dbl>
#1 198001     2     NA     NA
#2 198001     3     NA     NA
#3 198002     2     NA     NA
#4 198003     1 198002      1
#5 198003     2     NA     NA
#6 198003     3 198002      3

After running your code:

for (i in 1:nrow(tmp)) {
    if (!is.na(tmp$DATEPR[i])) {
        ind <- which(tmp$YYYYMM == tmp$DATEPR[i] & tmp$ID == tmp$IDPREV[i])
        tmp <- tmp[-ind, ]
    }
}
tmp %>% arrange(YYYYMM, ID)
# A tibble: 6 x 4
#  YYYYMM    ID DATEPR IDPREV
#   <dbl> <dbl>  <dbl>  <dbl>
#1 198001     2     NA     NA
#2 198001     3     NA     NA
#3 198002     2     NA     NA
#4 198003     1 198002      1
#5 198003     2     NA     NA
#6 198003     3 198002      3

Upvotes: 2

Related Questions