Reputation: 173

Rolling Calculation in R with condition

I have a data table such as:

 CurrOdo        Lat            NextLat       PrevODO        NextOdo
 2.62           30.01115868   30.01115868           
 5.19           30.01116407   30.01116407       
 7.61           30.01116919   30.01116919       
18.82                         30.01119282     7.61        19.06
19.06           30.01119282   30.01119282       
19.35           30.01119339   30.01119339       
20.54                         30.01122998     19.35       81.5
20.81                         30.01122998     20.54       81.5
37.38                         30.01122998     20.81       81.5
81.5            30.01132238   30.01132238   

atable<-data.table(odo = c(2.62,5.19,7.61,18.82,19.06,19.35,20.54,20.81, 37.38,81.5 ), 
Lat = c(30.01115868,30.01116407,30.01116919,NA,30.01119282,30.01119339,NA,NA, NA, 30.01132238),
NextLat=c(30.01115868,30.01116407,30.01116919, 30.01119282, 30.01119282,30.01119339, 
30.01122998,30.01122998,30.01122998,30.01122998 ),
PrevLat=c(NA,NA,NA, NA, NA,NA, NA,NA,NA,NA ),
PrevODO=c(NA,NA,NA, 7.61, NA,NA, 19.35,20.54,20.81,NA ),
NextOdo=c(NA,NA,NA, 19.06, NA,NA, 81.5,81.5,81.5,NA ))

the Lat value is a rolling calculation based on this formula:

Lat: (NextLat- PrevLat) * ((CurrODO - PrevODO) / (NextODO - PrevODO)) + PrevLat

Examples of how Lat would be calculated

Row CurrODO 18.82:   (30.01119282- 30.01116919) * (( 18.82 - 7.61) / (19.06 - 7.61)) + 30.01116919
Row CurrODO 20.54:  (30.01122998- 30.01119339) * ((  20.54 - 19.35) / (81.5 - 19.35)) + 30.01119339
Row CurrODO 20.81:   (30.01122998- Lat calc result from 20.54 row) * ((20.81 - 20.54) / (81.5 - 20.54)) + Lat calc result from 20.54 row
Row CurrODO 37.38:   (30.01122998- Lat calc result from 20.81 row) * (( 37.38 - 20.81) / (81.5 - 20.81)) + Lat calc result from 20.81 row

the final result would be:

CurrOdo     Lat             NextLat         PrevODO     NextOdo
2.62        30.01115868     30.01115868             
5.19        30.01116407     30.01116407             
7.61        30.01116919     30.01116919             
18.82       30.0111923247   30.01119282      7.61        19.06  
19.06       30.01119282     30.01119282             
19.35       30.01119339     30.01119339             
20.54       30.0111940906   30.01122998      19.35       81.5   
20.81       30.0111942496   30.01122998      20.54       81.5   
37.38       30.0112040049   30.01122998      20.81       81.5   
81.5        30.01132238     30.01132238

I am currently running this in SQL server in a loop, but it takes a really long time. I can place it in a loop with R as well, however it will not perform well with large datasets. I have been stuck on this for several days, so any help is appreciated!

Upvotes: 2

Answers (3)

Wimpel

Reputation: 27792

My answer involves a repeat-loop, although you said "no loops", but I am not seeing any other way (there might be of course, this is R ;-) ).
The loop should perform pretty fast though, on my system it takes about a second to fill in NA's in 10M rows (see benchmarks).

Output for Lat matches the desired output in the question.

sidenote:
you might run into problems if your first Lat has value NA.
since PrevLat will always be NA on the first row, a first-row-NA for Lat will never be recalculated, and the loop will never break.
You can (of course) build in an escape-route/break in the loop that prevents this. I kept this out, to keep the example readable and short.

repeat{
  #until there are no more NA in Lat
  if( sum( is.na( atable$Lat ) ) == 0 ){
    break
  }
  #(re)calculate PrevLat
  atable[, PrevLat := shift( Lat, 1, type = "lag" ) ]
  #calculate Lat when PrevLat is known, but Lat is not
  atable[ is.na( Lat ) & !is.na( PrevLat ),
          Lat := (NextLat-PrevLat)*((odo-PrevODO)/(NextOdo-PrevODO))+PrevLat ]
}

#       odo           Lat     NextLat       PrevLat PrevODO NextOdo
# 1:   2.62 30.0111586800 30.01115868            NA      NA      NA
# 2:   5.19 30.0111640700 30.01116407 30.0111586800      NA      NA
# 3:   7.61 30.0111691900 30.01116919 30.0111640700      NA      NA
# 4:  18.82 30.0111923247 30.01119282 30.0111691900    7.61   19.06
# 5:  19.06 30.0111928200 30.01119282 30.0111923247      NA      NA
# 6:  19.35 30.0111933900 30.01119339 30.0111928200      NA      NA
# 7:  20.54 30.0111940906 30.01122998 30.0111933900   19.35   81.50
# 8:  20.81 30.0111942496 30.01122998 30.0111940906   20.54   81.50
# 9:  37.38 30.0112040049 30.01122998 30.0111942496   20.81   81.50
# 10: 81.50 30.0113223800 30.01122998            NA      NA      NA

Benchmarks

On a data.table of 10M rows (your atable repeated 1M times);
On my system (+/- 6 year old i5 with 16Gb memory), the loop takes about a second to calculate a value for each Lat.

dt <- atable[rep(atable[, .I], 1000000)]

system.time(
  repeat{
    #until there are no more NA in Lat
    if( sum( is.na( dt$Lat ) ) == 0 ){
      break
    }
    #(re)calculate PrevLat
    dt[, PrevLat := shift( Lat, 1, type = "lag" ) ]
    #calculate Lat when PrevLat is known
    dt[ is.na( Lat ) & !is.na( PrevLat ),
            Lat := (NextLat- PrevLat ) * ((odo - PrevODO) / (NextOdo - PrevODO)) + PrevLat ]
  }
)

# user  system elapsed 
# 0.90    0.35    1.08

sessioninfo

R version 3.6.1 (2019-07-05)   
Platform: x86_64-w64-mingw32/x64 (64-bit)    
Running under: Windows 10 x64 (build 18362)      

other attached packages:    [1] data.table_1.12.4

update:: code explanation

What the code does:

it fills the column Prevlat with the Lat-value from the previous row
it identifies all rows where Lat is NA and where PrevLat has a value (i.e. is not NA)
for all rows identified in step 2., calculate the value for Lat, based on the function you provided

repeat steps 1 to 3, until the sum of the check is.na(atable$Lat) equals 0. When this condition is met, there are no more NA-values in the Lat column.. so we can exit the repeat-loop using break.

Upvotes: 5

Cole

Reputation: 11255

Here's a very explicit loop in data.table within {}:

library(data.table)
atable<-data.table(odo = c(2.62,5.19,7.61,18.82,19.06,19.35,20.54,20.81, 37.38,81.5 ), 
                   Lat = c(30.01115868,30.01116407,30.01116919,NA,30.01119282,30.01119339,NA,NA, NA, 30.01132238),
                   NextLat=c(30.01115868,30.01116407,30.01116919, 30.01119282, 30.01119282,30.01119339, 
                             30.01122998,30.01122998,30.01122998,30.01122998 ),
                   PrevLat=c(NA,NA,NA, NA, NA,NA, NA,NA,NA,NA ),
                   PrevODO=c(NA,NA,NA, 7.61, NA,NA, 19.35,20.54,20.81,NA ),
                   NextOdo=c(NA,NA,NA, 19.06, NA,NA, 81.5,81.5,81.5,NA )) 

options('digits' = 10)

atable[, c('na_rleid', 'LagLat') := .(rleid(is.na(PrevODO)), shift(NextLat))]

atable[!is.na(PrevODO),
       Lat := {x = vector('numeric', .N)
       const = ((odo - PrevODO) / (NextOdo - PrevODO))

       x[1] = (NextLat[1] - LagLat[1]) * const[1] + LagLat[1]

         for (i in seq_len(.N)[-1]){
           x[i] = (NextLat[i] - x[i-1]) * const[i] + x[i-1]
         }
         x
       },
       by = na_rleid
       ]

Upvotes: 1

user2474226

Reputation: 1502

I'll be happy to be corrected by an R guru, but I've not really seen easy ways to accumulate values forward without looping as you're doing.

But I guess if you install Rcpp and any associated paraphernalia, you could do something like this:

src <-
  "NumericVector fill_lat_na(NumericMatrix v){
    NumericVector ret(v.nrow());
    for(int i=0; i < v.nrow(); ++i){
      ret[i] = v(i, 1);
      if(NumericVector::is_na(ret[i])) 
      {
        ret[i] = (v(i, 2) - ret[i-1]) * ((v(i, 0) - v(i, 4)) / (v(i, 5) - v(i, 4))) + ret[i-1] ;
      }
    }
    return(ret);
  }
  "
Rcpp::cppFunction(src)

This will give you a function fill_lat_na() which you can then invoke in R fashion:

lat <- fill_lat_na(as.matrix(dfmat))

Note there is no lower bound check here, so e.g., if your first row has an NA in its Lat, this will fail. The function can perhaps be improved to reference named columns as well.

Upvotes: 2

Rolling Calculation in R with condition

Answers (3)

Related Questions