Fran
Fran

Reputation: 11

Applying a function to the whole data table by groups

Let's suppose the following data table:

a = runif(40)
b = c(rep(NA,5), runif(5), rep(NA,3),runif(3),NA,runif(3), c(rep(NA,3), runif(7), rep(NA,4), runif(3), NA,NA, runif(1)))
c = rep(1:4,each=10)
DT = data.table(a,b,c)

I want to eliminate the rows with the first NA values in b for every unique value in c (first NAs when c==1, when c==2...), but not the rows with the NAs that come after.

I can do it by using a loop:

for(i in unique(DT$c))
{
  first_NA = which(DT$c==i)[1]
  last_NA = which(!is.na(DT[,b]) & DT$c==i)[1] - 1
  DT = DT[-c(first_NA:last_NA)]
}

But I wonder if there is any simpler way of doing this by using a function for the whole data table using groups (by in data table or groupby in dplyr), without just applying it to columns.

Thank you!

Upvotes: 1

Views: 90

Answers (2)

Peace Wang
Peace Wang

Reputation: 2419

You can filter out the first NA values in b through

DT[, .SD[cumsum( !is.na(b) ) != 0], by = .(c)]

Upvotes: 3

Billy34
Billy34

Reputation: 2174

You have to mark these lines then keep those not marked.

# mark values
DT <- DT[, by=c, 
  flag := is.na(b[1]) # first value of b is NA
    & (seq_len(.N)==1) # only for first value
]

# discard marked
DT <- DT[(!flag)]

# remove flag
DT[, flag:=NULL]

or in a row

DT[, by=c, flag:=is.na(b[1]) & (seq_len(.N)==1)][(!flag)][, flag:=NULL]

Upvotes: 1

Related Questions