ira
ira

Reputation: 2644

How to efficiently find last preceding row with nonzero value in R data.table

Introduction

I am trying to find the optimal way how to find the last preceding row with nonzero value in a given column and return a value of different column on that row. I want to do it in R data.table and i am looking for maximum efficiency of that operation.

Example

let's have a data table like so:

set.seed(123)
DT = data.table(x=rep(c("b","a","c"),each=6),
                y=rep(1:6, 3),
                z = rbinom(18, 1, 0.3))

That gives us the following data table:

    x y z
 1: b 1 0
 2: b 2 1
 3: b 3 0
 4: b 4 1
 5: b 5 1
 6: b 6 0
 7: a 1 0
 8: a 2 1
 9: a 3 0
10: a 4 0
11: a 5 1
12: a 6 0
13: c 1 0
14: c 2 0
15: c 3 0
16: c 4 1
17: c 5 0
18: c 6 0

Now, the table is for each value in column x ordered by the column y. For each group given by the values in column x, I would like to create a column which would give me for each row the value of y from the row with last nonzero value of z.

Right now I am using lapply for each y and grouping by x which gives the desired result:

DT[, list(y,
          z, 
          output = lapply(y, function(x) max(y[z != 0 & y <= x]))
          ), 
   by = 'x']

The question

Can i make my code from the example more efficient?

Upvotes: 7

Views: 474

Answers (3)

chinsoon12
chinsoon12

Reputation: 25225

Another option using rolling join:

DT[, output:= DT[z==1][.SD, on=.(x, y), roll=Inf, x.y]]

output:

    x y z output
 1: b 1 0     NA
 2: b 2 1      2
 3: b 3 0      2
 4: b 4 1      4
 5: b 5 1      5
 6: b 6 0      5
 7: a 1 0     NA
 8: a 2 1      2
 9: a 3 0      2
10: a 4 0      2
11: a 5 1      5
12: a 6 0      5
13: c 1 0     NA
14: c 2 0     NA
15: c 3 0     NA
16: c 4 1      4
17: c 5 0      4
18: c 6 0      4

Upvotes: 3

akrun
akrun

Reputation: 887501

An option with non-equi join

library(data.table)
library(zoo)
DT[DT[z!=0, .(y1 = y, x)], output := y1, on = .(x, y <= y1), 
         mult = 'last'][, output := na.locf0(output), x]
DT
#    x y z output
# 1: b 1 0     NA
# 2: b 2 1      2
# 3: b 3 0      2
# 4: b 4 1      4
# 5: b 5 1      5
# 6: b 6 0      5
# 7: a 1 0     NA
# 8: a 2 1      2
# 9: a 3 0      2
#10: a 4 0      2
#11: a 5 1      5
#12: a 6 0      5
#13: c 1 0     NA
#14: c 2 0     NA
#15: c 3 0     NA
#16: c 4 1      4
#17: c 5 0      4
#18: c 6 0      4

Upvotes: 3

MichaelChirico
MichaelChirico

Reputation: 34733

You might try using nafill:

# create a dummy column that is only populated for nonzero z (and hence NA elsewhere)
DT[z != 0, y_copy := y]
# nafill on this column using LOCF strategy by group:
DT[ , ans := nafill(y_copy, type = 'locf'), by = x][]
#     x y z y_copy ans
#  1: b 1 0     NA  NA
#  2: b 2 1      2   2
#  3: b 3 0     NA   2
#  4: b 4 1      4   4
#  5: b 5 1      5   5
#  6: b 6 0     NA   5
#  7: a 1 0     NA  NA
#  8: a 2 1      2   2
#  9: a 3 0     NA   2
# 10: a 4 0     NA   2
# 11: a 5 1      5   5
# 12: a 6 0     NA   5
# 13: c 1 0     NA  NA
# 14: c 2 0     NA  NA
# 15: c 3 0     NA  NA
# 16: c 4 1      4   4
# 17: c 5 0     NA   4
# 18: c 6 0     NA   4

For now, nafill is a development only feature (data.table 1.12.3+) but 1.12.4 should be on CRAN in the next week or two. For the moment, you can install this with install.packages('data.table', type = 'source', repos = 'http://Rdatatable.github.io/data.table')

If you don't want to create y_copy, you could do this inline with is.na<-:

DT[ , ans := nafill(`is.na<-`(y, z == 0), type = 'locf'), by = x]

This will be inefficient because z==0 is calculated repeatedly by group (instead of as a single vector); you could do this in the first step then:

DT[ , z_zero := z == 0]

But this means another dummy column (with less storage than y_copy if y is numeric, character, or complex)

Upvotes: 5

Related Questions