qnguyen
qnguyen

Reputation: 11

How to delete 3 consecutive rows that contain the same value in a data.table

I have a data table in R with 3 features as follow

DT_A <- data.table(sid=c(1,1,2,2,2,3,3,2,3,3), date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                                                          "2014-06-23","2014-06-24","2014-06-25","2014-06-26")), 
               Status1 = c("A","B","A","A","B","A","A","A","B","B"))

The data looks like this

    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B

How can i check the Status 1 and see if there are 3 rows in a row that has value A (like row 6,7,8) then we will delete these?

Upvotes: 0

Views: 98

Answers (3)

Uwe
Uwe

Reputation: 42544

The question is tagged data.table, so I'll try to give an appropriate answer:

DT_A[!DT_A[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
     sid       date Status1
1:     1 2014-06-22       A
2:     1 2014-06-23       B
3:     2 2014-06-22       A
4:     2 2014-06-23       A
5:     2 2014-06-24       B
6:     3 2014-06-25       B
7:     3 2014-06-26       B

Other test cases

As pointed out by Frank, my first answer (now edited) was working just for the given sample data set provided by the OP but failed for other test cases.

So, the edited code is applied to some other test cases.

Case B: 3 consecutive rows of letters A and B

DT_B <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","A","B","B","B"))
DT_B
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       A
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       B
DT_B[!DT_B[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
   sid       date Status1
1:   1 2014-06-22       A
2:   1 2014-06-23       B
3:   2 2014-06-22       A
4:   2 2014-06-23       A
5:   2 2014-06-24       B
6:   3 2014-06-25       B
7:   3 2014-06-26       B
8:   3 2014-06-26       B

Only the 3 consecutive rows containing letter A (rows 6 to 8) are removed.

Case C: Nothing to remove

DT_C <- data.table(
  sid=c(1,1,2,2,2,3,3,2,3,3,3), 
  date=as.Date(c("2014-06-22","2014-06-23","2014-06-22","2014-06-23", "2014-06-24","2014-06-22",
                 "2014-06-23","2014-06-24","2014-06-25","2014-06-26","2014-06-26")), 
  Status1 = c("A","B","A","A","B","A","A","C","B","B","C"))
DT_C
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       C
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       C
DT_C[!DT_C[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
    sid       date Status1
 1:   1 2014-06-22       A
 2:   1 2014-06-23       B
 3:   2 2014-06-22       A
 4:   2 2014-06-23       A
 5:   2 2014-06-24       B
 6:   3 2014-06-22       A
 7:   3 2014-06-23       A
 8:   2 2014-06-24       C
 9:   3 2014-06-25       B
10:   3 2014-06-26       B
11:   3 2014-06-26       C

No row is removed as there are no 3 consecutive rows containing A.

Case D: Edge case: remove all rows

DT_D <- DT_A[6:8]
DT_D
   sid       date Status1
1:   3 2014-06-22       A
2:   3 2014-06-23       A
3:   2 2014-06-24       A
DT_D[!DT_D[, .I[.N == 3 & Status1 == "A"], by = rleid(Status1)]$V1]
Empty data.table (0 rows) of 3 cols: sid,date,Status1

All rows are removed and an empty data.table is returned because the input data.table consists only of 3 rows with letter A.

Upvotes: 2

denis
denis

Reputation: 5673

I am supposing you are making a mistake in your sid definition, and that your 3 lines have all sid = 3. If not, sorry my answer will not work. If it is the case the solution can be one line:

 DT_A[,.SD[.N < 3 | Status1 != "A",], by = .(sid,Status1)]

Is a simple line that does what you want : it select the data where the number of line is less than 3 or different than B in column Status1 (that is the negation of your selection you want to make to delete : at least 3 A) when grouping by sid and Status1. Hope it helps

Upvotes: 1

d.b
d.b

Reputation: 32548

with(rle(DT_A$Status1 == "A"), {
    unlist(lapply(which(lengths >= 3), function(i)
        (1+cumsum(lengths)[i-1]):cumsum(lengths)[i]))
})
#[1] 6 7 8

Upvotes: 1

Related Questions