data.table: filtering out data entry errors by selectively omitting rows

Question

In a project with several million rows of information, where we want to have one row per year per case under study, we discovered a data entry error so that some cases have errors of extra rows with some variables that differ. (That is to say, these are not fixable with unique or duplicated.) After checking many of these by hand, we understand the problem and how the result should be. But I need your help to make data.table do the right thing.

I made a small test case. Code to reproduce the data is below.

> DT
    year case status
 1: 1980    a   born
 2: 1980    a  alive
 3: 1981    a  alive
 4: 1982    a  alive
 5: 1999    b  alive
 6: 1999    b  alive
 7: 2000    b  alive
 8: 2004    c  alive
 9: 2005    c  alive
10: 1977    d  alive
11: 1977    d   dead
12: 1983    e  alive
13: 1984    e   born
14: 1984    e  alive
15: 1985    e  alive
16: 1986    e  alive
17: 2000    f  alive
18: 2001    f  alive
19: 2002    f  alive
20: 2002    f   dead
21: 2003    f  alive
    year case status

Problems to fix

Case "a" has 2 rows for 1980. Because that is first record for that case, we want to keep the one that says "born" and remove the one "alive". Case "b" has 2 rows for 1999, but we don't have "born" in either one. We want to keep just one "alive" row.
Case "e" has an erroneous "born" in 1984. Since it was "alive" in 1983, the 1984 "born" should be removed.
Case "f" has an erroneous "dead" in 2002. Because it shows as "alive" in 2003, we believe "dead" is an error. So delete the one that says "dead", but only because the next time after has an "alive".

Case "d" has 2 rows in 1977, but we want to keep both

I'm stuck on finding a good way to isolate the row groups. For each case, I want to mark the rows for the first year, then figure what to do. It seemed like naming the grow groups with .GRP would give some clarity, but I still have problem of isolating first row group for each case.

library(data.table)
DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
                          2004, 2005, 1977, 1977, 1983, 1984, 1984,
                          1985, 1986, 2000, 2001, 2002, 2002, 2003),
                 case = c("a", "a",  "a", "a",   "b",  "b", "b", "c",
                           "c", "d", "d", "e",  "e", "e", "e", "e",
                           "f", "f", "f", "f", "f"),
                 status = c("born", "alive", "alive", "alive", "alive",
                            "alive", "alive", "alive", "alive", "alive",
                            "dead", "alive", "born", "alive", "alive", "alive",
                            "alive", "alive", "alive", "dead", "alive"))

## re-order the rows, just in case
DT <- DT[order(case, year, status)]
## A correct answer would be:
DT[-c(1, 5, 14, 20)]

## Here is my effort to fix problem 1.
setkey(DT, case, year)

## create a bunch of index variables, naive way to
## find first year with multiple rows
DT[ , idx:=1:.N, by = list(case, year)]
## number of rows with case,year 
DT[ , count := uniqueN(status), by = list(case, year)]
DT[ , caseyr := 1:.N, by = list(case)]
DT[ , casegrp := .GRP, by = list(case, year)]

I thought this would isolate the first block for each case, but fails. Note it picks up case "e"

 > DT[count > 1 & 1 %in% idx & 1 %in% caseyr, .SD, by = list(case) ]
    case year status idx count caseyr casegrp
 1:    a 1980   born   1     2      1       1
 2:    a 1980  alive   2     2      2       1
 3:    b 1999  alive   1     2      1       4
 4:    b 1999  alive   2     2      2       4
 5:    d 1977  alive   1     2      1       8
 6:    d 1977   dead   2     2      2       8
 7:    e 1984   born   1     2      2      10
 8:    e 1984  alive   2     2      3      10
 9:    f 2002  alive   1     2      3      15
10:    f 2002   dead   2     2      4      15

BenoitLondon · Accepted Answer

here's a solution:

library(data.table)
DT <- data.table(year = c(1980, 1980, 1981, 1982, 1999, 1999, 2000,
                          2004, 2005, 1977, 1977, 1983, 1984, 1984,
                          1985, 1986, 2000, 2001, 2002, 2002, 2003),
                 case = c("a", "a",  "a", "a",   "b",  "b", "b", "c",
                          "c", "d", "d", "e",  "e", "e", "e", "e",
                          "f", "f", "f", "f", "f"),
                 status = c("born", "alive", "alive", "alive", "alive",
                            "alive", "alive", "alive", "alive", "alive",
                            "dead", "alive", "born", "alive", "alive", "alive",
                            "alive", "alive", "alive", "dead", "alive"))
#to check the ids removed are 1,5,14,20
setkey(DT,case, year, status)
DT[, id := 1:.N]
DT <- DT[order(case, year, status, -id)]

#remove duplicated alive (or other)
DT <- DT[!duplicated(DT[, list(case, year, status)])]

#compute year ordering
DT[, status_rank := rank(year), by = list(case)]

#remove late born
DT[, is_first := status_rank == min(status_rank), by = case]
DT <- DT[status != "born" | is_first]

#remove early dead
DT[, is_last := status_rank == max(status_rank), by = case]
DT <- DT[status != "dead" | is_last]

#remove redundant alive unless it's with dead
DT[, keep_alive := paste(sort(unique(status)), collapse = "") %in% c("alive", "alivedead") , by = list(case, year)]
DT <- DT[status != "alive" | keep_alive]
DT[, c("status_rank", "is_first", "is_last", "keep_alive") := NULL]
DT
    year case status id
 1: 1980    a   born  2
 2: 1981    a  alive  3
 3: 1982    a  alive  4
 4: 1999    b  alive  6
 5: 2000    b  alive  7
 6: 2004    c  alive  8
 7: 2005    c  alive  9
 8: 1977    d  alive 10
 9: 1977    d   dead 11
10: 1983    e  alive 12
11: 1984    e  alive 13
12: 1985    e  alive 15
13: 1986    e  alive 16
14: 2000    f  alive 17
15: 2001    f  alive 18
16: 2002    f  alive 19
17: 2003    f  alive 21

data.table: filtering out data entry errors by selectively omitting rows

Answers (2)

Related Questions