R impute with Kalman on large data

Question

I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.

> str(dt)
Classes ‘data.table’ and 'data.frame':  4666972 obs. of  5 variables:
 $ Year : int  1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
 $ State: int  1 1 1 1 1 1 1 1 1 1 ...
 $ CC   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ ID   : chr  "1" "1" "1" "2" ...
 $ MPR  : num  54 54 55 52 52 53 60 60 65 70 ...

I tried the code below but it crashed after a while.

> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]

I don't know how to improve the time efficiency and impute successfully without crashed.

Is it better to split the dataset with ID to list and impute each of them with for loop?

> length(unique(hpms_S3$Section_ID))
[1] 668184

> split(dt, dt$ID)

However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.

Is there any great way to do or how can I optimize code I did?

I provide the simple sample here:

# dt
Year  State   CC   ID    MPR    
2002     15   3     3     NA  
2003     15   3     3     NA  
2004     15   3     3    193   
2005     15   3     3    193  
2006     15   3     3    348  
2007     15   3     3    388  
2008     15   3     3    388  
1999     53   33    1     NA  
2000     53   33    1     NA       
2002     53   33    1     NA      
2003     53   33    1     NA   
2004     53   33    1     NA     
2005     53   33    1    170  
2006     53   33    1    170        
2007     53   33    1    330      
2008     53   33    1    330

EDIT:
As @r2evans mentioned in comment, I modified the code:

> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]

Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0,  : 
  L-BFGS-B needs finite values of 'fn'

I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.

R impute with Kalman on large data

Answers (1)

Related Questions