Reputation: 173
I have a data table such as:
CurrOdo Lat NextLat PrevODO NextOdo
2.62 30.01115868 30.01115868
5.19 30.01116407 30.01116407
7.61 30.01116919 30.01116919
18.82 30.01119282 7.61 19.06
19.06 30.01119282 30.01119282
19.35 30.01119339 30.01119339
20.54 30.01122998 19.35 81.5
20.81 30.01122998 20.54 81.5
37.38 30.01122998 20.81 81.5
81.5 30.01132238 30.01132238
atable<-data.table(odo = c(2.62,5.19,7.61,18.82,19.06,19.35,20.54,20.81, 37.38,81.5 ),
Lat = c(30.01115868,30.01116407,30.01116919,NA,30.01119282,30.01119339,NA,NA, NA, 30.01132238),
NextLat=c(30.01115868,30.01116407,30.01116919, 30.01119282, 30.01119282,30.01119339,
30.01122998,30.01122998,30.01122998,30.01122998 ),
PrevLat=c(NA,NA,NA, NA, NA,NA, NA,NA,NA,NA ),
PrevODO=c(NA,NA,NA, 7.61, NA,NA, 19.35,20.54,20.81,NA ),
NextOdo=c(NA,NA,NA, 19.06, NA,NA, 81.5,81.5,81.5,NA ))
the Lat value is a rolling calculation based on this formula:
Lat: (NextLat- PrevLat) * ((CurrODO - PrevODO) / (NextODO - PrevODO)) + PrevLat
Examples of how Lat would be calculated
Row CurrODO 18.82: (30.01119282- 30.01116919) * (( 18.82 - 7.61) / (19.06 - 7.61)) + 30.01116919
Row CurrODO 20.54: (30.01122998- 30.01119339) * (( 20.54 - 19.35) / (81.5 - 19.35)) + 30.01119339
Row CurrODO 20.81: (30.01122998- Lat calc result from 20.54 row) * ((20.81 - 20.54) / (81.5 - 20.54)) + Lat calc result from 20.54 row
Row CurrODO 37.38: (30.01122998- Lat calc result from 20.81 row) * (( 37.38 - 20.81) / (81.5 - 20.81)) + Lat calc result from 20.81 row
the final result would be:
CurrOdo Lat NextLat PrevODO NextOdo
2.62 30.01115868 30.01115868
5.19 30.01116407 30.01116407
7.61 30.01116919 30.01116919
18.82 30.0111923247 30.01119282 7.61 19.06
19.06 30.01119282 30.01119282
19.35 30.01119339 30.01119339
20.54 30.0111940906 30.01122998 19.35 81.5
20.81 30.0111942496 30.01122998 20.54 81.5
37.38 30.0112040049 30.01122998 20.81 81.5
81.5 30.01132238 30.01132238
I am currently running this in SQL server in a loop, but it takes a really long time. I can place it in a loop with R as well, however it will not perform well with large datasets. I have been stuck on this for several days, so any help is appreciated!
Upvotes: 2
Views: 394
Reputation: 27792
My answer involves a repeat-loop, although you said "no loops", but I am not seeing any other way (there might be of course, this is R ;-) ).
The loop should perform pretty fast though, on my system it takes about a second to fill in NA's in 10M rows (see benchmarks).
Output for Lat matches the desired output in the question.
sidenote:
you might run into problems if your first Lat
has value NA
.
since PrevLat
will always be NA on the first row, a first-row-NA for Lat will never be recalculated, and the loop will never break.
You can (of course) build in an escape-route/break in the loop that prevents this. I kept this out, to keep the example readable and short.
repeat{
#until there are no more NA in Lat
if( sum( is.na( atable$Lat ) ) == 0 ){
break
}
#(re)calculate PrevLat
atable[, PrevLat := shift( Lat, 1, type = "lag" ) ]
#calculate Lat when PrevLat is known, but Lat is not
atable[ is.na( Lat ) & !is.na( PrevLat ),
Lat := (NextLat-PrevLat)*((odo-PrevODO)/(NextOdo-PrevODO))+PrevLat ]
}
# odo Lat NextLat PrevLat PrevODO NextOdo
# 1: 2.62 30.0111586800 30.01115868 NA NA NA
# 2: 5.19 30.0111640700 30.01116407 30.0111586800 NA NA
# 3: 7.61 30.0111691900 30.01116919 30.0111640700 NA NA
# 4: 18.82 30.0111923247 30.01119282 30.0111691900 7.61 19.06
# 5: 19.06 30.0111928200 30.01119282 30.0111923247 NA NA
# 6: 19.35 30.0111933900 30.01119339 30.0111928200 NA NA
# 7: 20.54 30.0111940906 30.01122998 30.0111933900 19.35 81.50
# 8: 20.81 30.0111942496 30.01122998 30.0111940906 20.54 81.50
# 9: 37.38 30.0112040049 30.01122998 30.0111942496 20.81 81.50
# 10: 81.50 30.0113223800 30.01122998 NA NA NA
Benchmarks
On a data.table of 10M rows (your atable
repeated 1M times);
On my system (+/- 6 year old i5 with 16Gb memory), the loop takes about a second to calculate a value for each Lat.
dt <- atable[rep(atable[, .I], 1000000)]
system.time(
repeat{
#until there are no more NA in Lat
if( sum( is.na( dt$Lat ) ) == 0 ){
break
}
#(re)calculate PrevLat
dt[, PrevLat := shift( Lat, 1, type = "lag" ) ]
#calculate Lat when PrevLat is known
dt[ is.na( Lat ) & !is.na( PrevLat ),
Lat := (NextLat- PrevLat ) * ((odo - PrevODO) / (NextOdo - PrevODO)) + PrevLat ]
}
)
# user system elapsed
# 0.90 0.35 1.08
sessioninfo
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
other attached packages: [1] data.table_1.12.4
update:: code explanation
What the code does:
Prevlat
with the Lat
-value from the previous rowLat
is NA and where PrevLat
has a value (i.e. is not NA)Lat
, based on the function you providedrepeat steps 1 to 3, until the sum of the check is.na(atable$Lat)
equals 0. When this condition is met, there are no more NA-values in the Lat
column.. so we can exit the repeat
-loop using break
.
Upvotes: 5
Reputation: 11255
Here's a very explicit loop in data.table within {}
:
library(data.table)
atable<-data.table(odo = c(2.62,5.19,7.61,18.82,19.06,19.35,20.54,20.81, 37.38,81.5 ),
Lat = c(30.01115868,30.01116407,30.01116919,NA,30.01119282,30.01119339,NA,NA, NA, 30.01132238),
NextLat=c(30.01115868,30.01116407,30.01116919, 30.01119282, 30.01119282,30.01119339,
30.01122998,30.01122998,30.01122998,30.01122998 ),
PrevLat=c(NA,NA,NA, NA, NA,NA, NA,NA,NA,NA ),
PrevODO=c(NA,NA,NA, 7.61, NA,NA, 19.35,20.54,20.81,NA ),
NextOdo=c(NA,NA,NA, 19.06, NA,NA, 81.5,81.5,81.5,NA ))
options('digits' = 10)
atable[, c('na_rleid', 'LagLat') := .(rleid(is.na(PrevODO)), shift(NextLat))]
atable[!is.na(PrevODO),
Lat := {x = vector('numeric', .N)
const = ((odo - PrevODO) / (NextOdo - PrevODO))
x[1] = (NextLat[1] - LagLat[1]) * const[1] + LagLat[1]
for (i in seq_len(.N)[-1]){
x[i] = (NextLat[i] - x[i-1]) * const[i] + x[i-1]
}
x
},
by = na_rleid
]
Upvotes: 1
Reputation: 1502
I'll be happy to be corrected by an R guru, but I've not really seen easy ways to accumulate values forward without looping as you're doing.
But I guess if you install Rcpp and any associated paraphernalia, you could do something like this:
src <-
"NumericVector fill_lat_na(NumericMatrix v){
NumericVector ret(v.nrow());
for(int i=0; i < v.nrow(); ++i){
ret[i] = v(i, 1);
if(NumericVector::is_na(ret[i]))
{
ret[i] = (v(i, 2) - ret[i-1]) * ((v(i, 0) - v(i, 4)) / (v(i, 5) - v(i, 4))) + ret[i-1] ;
}
}
return(ret);
}
"
Rcpp::cppFunction(src)
This will give you a function fill_lat_na()
which you can then invoke in R fashion:
lat <- fill_lat_na(as.matrix(dfmat))
Note there is no lower bound check here, so e.g., if your first row has an NA in its Lat, this will fail. The function can perhaps be improved to reference named columns as well.
Upvotes: 2