Shawn Brar
Shawn Brar

Reputation: 1420

Remove subsequent rows of a group after first occurence of 0 in a column

I have data.table as following:-

data <- data.table(k = c("a", "a", "a", "a", "b", "b", "c", "c", "c", "d"),
               year = c(2011, 2012, 2013, 2014, 2012, 2013, 2014, 2015, 2016, 2001),
               grow_bool = c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0))

#    k year grow_bool
# 1: a 2011         1
# 2: a 2012         1
# 3: a 2013         0
# 4: a 2014         1
# 5: b 2012         0
# 6: b 2013         1
# 7: c 2014         1
# 8: c 2015         0
# 9: c 2016         1
#10: d 2001         0

Now, I want to remove the rows which have a 0 in the column grow_bool ans subsequent rows after it for each k. For example:- a in 2013 has a 0 in grow_bool, hence, all this row and all the rows for a after this row should be deleted. All the rows for b will be deleted since, the first row for b consists a 0. d has only one row and it will be deleted since, it has 0.

The resulting data.table should be of the following form:-

data_2 <- data.table(k = c(a, a, c),
                     year = c(2011, 2012, 2014),
                     grow_bool = c(1, 1, 1))

#    k year grow_bool
# 1: a 2011         1
# 2: a 2012         1
# 3: c 2014         1

Thanks in advance.

Upvotes: 3

Views: 127

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Here is another option using cummin:

DT[DT[, .I[cummin(value) >= 1L], ID]$V1]

or

DT[DT[, cummin(value) >= 1L, ID]$V1]

Timing depends on the number of groups within your dataset. Here are some timings when there are large number of groups:

library(data.table)
set.seed(0L)
nr <- 1e7L
ng <- 5e5L
DT <- data.table(ID=sample(ng, nr, TRUE), value=sample(0:1, nr, TRUE))
setorder(DT, ID)

microbenchmark::microbenchmark(times=3L, 
    check=function(values) {
        all(sapply(values[-1L], function(x) fsetequal(values[[1L]], x)))
    },

    DT[!DT[, cumsum(value==0L) > 0L, ID]$V1],
    DT[-DT[, .I[cumsum(value==0L)>0L], ID]$V1],
    DT[DT[, .I[cumsum(value==0L)==0L], ID]$V1],
    DT[DT[, .I[cummin(value) >= 1L], ID]$V1],
    DT[DT[, cummin(value) >= 1L, ID]$V1])

timings:

Unit: milliseconds
                                           expr       min        lq      mean    median        uq       max neval
     DT[!DT[, cumsum(value == 0L) > 0L, ID]$V1]  766.2710  774.0762  796.6203  781.8814  811.7950  841.7086     3
 DT[-DT[, .I[cumsum(value == 0L) > 0L], ID]$V1] 1040.6440 1086.4132 1105.8104 1132.1824 1138.3936 1144.6047     3
 DT[DT[, .I[cumsum(value == 0L) == 0L], ID]$V1]  774.5525  828.6320  851.2925  882.7115  889.6626  896.6136     3
       DT[DT[, .I[cummin(value) >= 1L], ID]$V1]  723.6663  727.0806  731.0786  730.4950  734.7848  739.0745     3
           DT[DT[, cummin(value) >= 1L, ID]$V1]  657.1164  661.7835  664.8258  666.4506  668.6805  670.9103     3

check:

a0 <- DT[!DT[, cumsum(value==0L) > 0L, ID]$V1]
a1 <- DT[DT[, .I[cumsum(value==0L)==0L], ID]$V1]
a2 <- DT[-DT[, .I[cumsum(value==0L)>0L], ID]$V1]
a3 <- DT[DT[, .I[cummin(value) >= 1L], ID]$V1]
a4 <- DT[DT[, cummin(value) >= 1L, ID]$V1]
fsetequal(a0, a1)
#[1] TRUE
fsetequal(a0, a2)
#[1] TRUE
fsetequal(a0, a3)
#[1] TRUE
fsetequal(a0, a4)
#[1] TRUE

Upvotes: 3

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

library(data.table)
library(magrittr)
data <-
  data.table(
    k = c("a", "a", "a", "a", "b", "b", "c", "c", "c", "d"),
    year = c(2011, 2012, 2013, 2014, 2012, 2013, 2014, 2015, 2016, 2001),
    grow_bool = c(1, 1, 0, 1, 0, 1, 1, 0, 1, 0)
  )

data[, fltr := cumsum(grow_bool == 0), by = k] %>% 
  .[fltr == 0] %>% 
  .[, fltr := NULL] %>% 
  .[]
#>    k year grow_bool
#> 1: a 2011         1
#> 2: a 2012         1
#> 3: c 2014         1

Created on 2021-07-23 by the reprex package (v2.0.0)

Upvotes: 0

Wimpel
Wimpel

Reputation: 27772

data[!data[, cumsum(grow_bool == 0) > 0, by = .(k)]$V1, ]
#    k year grow_bool
# 1: a 2011         1
# 2: a 2012         1
# 3: c 2014         1

Upvotes: 6

Related Questions