Peter Chen
Peter Chen

Reputation: 1484

R impute with Kalman on large data

I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.

> str(dt)
Classes ‘data.table’ and 'data.frame':  4666972 obs. of  5 variables:
 $ Year : int  1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
 $ State: int  1 1 1 1 1 1 1 1 1 1 ...
 $ CC   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ ID   : chr  "1" "1" "1" "2" ...
 $ MPR  : num  54 54 55 52 52 53 60 60 65 70 ...

I tried the code below but it crashed after a while.

> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]

I don't know how to improve the time efficiency and impute successfully without crashed.

Is it better to split the dataset with ID to list and impute each of them with for loop?

> length(unique(hpms_S3$Section_ID))
[1] 668184

> split(dt, dt$ID)

However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.

Is there any great way to do or how can I optimize code I did?

I provide the simple sample here:

# dt
Year  State   CC   ID    MPR    
2002     15   3     3     NA  
2003     15   3     3     NA  
2004     15   3     3    193   
2005     15   3     3    193  
2006     15   3     3    348  
2007     15   3     3    388  
2008     15   3     3    388  
1999     53   33    1     NA  
2000     53   33    1     NA       
2002     53   33    1     NA      
2003     53   33    1     NA   
2004     53   33    1     NA     
2005     53   33    1    170  
2006     53   33    1    170        
2007     53   33    1    330      
2008     53   33    1    330          

EDIT:
As @r2evans mentioned in comment, I modified the code:

> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]

Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
  L-BFGS-B needs finite values of 'fn'

I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.

Upvotes: 0

Views: 227

Answers (1)

r2evans
r2evans

Reputation: 160687

Perhaps splitting should be done using data.table's by= operator, perhaps more efficient.

Since I don't have imputeTS installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table.

myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)

Now some equivalent calls, one with your with(dt, ...) and my alternatives (which are really walk-throughs until my ultimate suggestion of 5):

dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
#     Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
#  1: 2002    15  3  3  NA         193         193         193         193         193
#  2: 2003    15  3  3  NA         193         193         193         193         193
#  3: 2004    15  3  3 193         193         193         193         193         193
#  4: 2005    15  3  3 193         193         193         193         193         193
#  5: 2006    15  3  3 348         348         348         348         348         348
#  6: 2007    15  3  3 388         388         388         388         388         388
#  7: 2008    15  3  3 388         388         388         388         388         388
#  8: 1999    53 33  1  NA         170         170         170         170         170
#  9: 2000    53 33  1  NA         170         170         170         170         170
# 10: 2002    53 33  1  NA         170         170         170         170         170
# 11: 2003    53 33  1  NA         170         170         170         170         170
# 12: 2004    53 33  1  NA         170         170         170         170         170
# 13: 2005    53 33  1 170         170         170         170         170         170
# 14: 2006    53 33  1 170         170         170         170         170         170
# 15: 2007    53 33  1 330         330         330         330         330         330
# 16: 2008    53 33  1 330         330         330         330         330         330

The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table preferred.

The use of with(dt, ...) is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table can do a lot of things behind-the-scenes so that the calculations/function-calls within the j= component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table example is conditional replacement, as in dt[is.na(x), x := -1]. With the reference to the enter table dt inside of this, if there is ever something in the first argument (conditional replacement) or a by= argument, then it fails.

MPR_kalman2 mitigates this by using .SD, which is data.table's way of replacing the data-to-be-used with the "Subset of the Data" (ref). But it's still not taking advantage of data.table's significant efficiencies in dealing in-memory with groups.

MPR_kalman3 works on this by grouping outside, still using with but not (as in 2) in a more friendly way.

MPR_kalman4 removes the use of with, since really the MPR visible to ave is only within each group anyway. And then when you think about it, since ave is given no grouping variables, it really just passes all of the MPR data straight-through to myimpute. From this, we have MPR_kalman5, a direct method that is along the normal patterns of data.table.

While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table's ways).

Upvotes: 2

Related Questions