Reputation: 1484
I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR
, with Kalman method based on each groups.
> str(dt)
Classes ‘data.table’ and 'data.frame': 4666972 obs. of 5 variables:
$ Year : int 1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
$ State: int 1 1 1 1 1 1 1 1 1 1 ...
$ CC : int 1 1 1 1 1 1 1 1 1 1 ...
$ ID : chr "1" "1" "1" "2" ...
$ MPR : num 54 54 55 52 52 53 60 60 65 70 ...
I tried the code below but it crashed after a while.
> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]
I don't know how to improve the time efficiency and impute successfully without crashed.
Is it better to split the dataset with ID
to list and impute each of them with for loop
?
> length(unique(hpms_S3$Section_ID))
[1] 668184
> split(dt, dt$ID)
However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184
lists and impute, I need to do multiple times and then combine to one dataset at last.
Is there any great way to do or how can I optimize code I did?
I provide the simple sample here:
# dt
Year State CC ID MPR
2002 15 3 3 NA
2003 15 3 3 NA
2004 15 3 3 193
2005 15 3 3 193
2006 15 3 3 348
2007 15 3 3 388
2008 15 3 3 388
1999 53 33 1 NA
2000 53 33 1 NA
2002 53 33 1 NA
2003 53 33 1 NA
2004 53 33 1 NA
2005 53 33 1 170
2006 53 33 1 170
2007 53 33 1 330
2008 53 33 1 330
EDIT:
As @r2evans mentioned in comment, I modified the code:
> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level')
, I still got error. I think there might be some repeated values within groups so that it produced error.
Upvotes: 0
Views: 227
Reputation: 160687
Perhaps splitting should be done using data.table
's by=
operator, perhaps more efficient.
Since I don't have imputeTS
installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf
, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table
.
myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
Now some equivalent calls, one with your with(dt, ...)
and my alternatives (which are really walk-throughs until my ultimate suggestion of 5
):
dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
# Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
# 1: 2002 15 3 3 NA 193 193 193 193 193
# 2: 2003 15 3 3 NA 193 193 193 193 193
# 3: 2004 15 3 3 193 193 193 193 193 193
# 4: 2005 15 3 3 193 193 193 193 193 193
# 5: 2006 15 3 3 348 348 348 348 348 348
# 6: 2007 15 3 3 388 388 388 388 388 388
# 7: 2008 15 3 3 388 388 388 388 388 388
# 8: 1999 53 33 1 NA 170 170 170 170 170
# 9: 2000 53 33 1 NA 170 170 170 170 170
# 10: 2002 53 33 1 NA 170 170 170 170 170
# 11: 2003 53 33 1 NA 170 170 170 170 170
# 12: 2004 53 33 1 NA 170 170 170 170 170
# 13: 2005 53 33 1 170 170 170 170 170 170
# 14: 2006 53 33 1 170 170 170 170 170 170
# 15: 2007 53 33 1 330 330 330 330 330 330
# 16: 2008 53 33 1 330 330 330 330 330 330
The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table
preferred.
The use of with(dt, ...)
is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table
can do a lot of things behind-the-scenes so that the calculations/function-calls within the j=
component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table
example is conditional replacement, as in dt[is.na(x), x := -1]
. With the reference to the enter table dt
inside of this, if there is ever something in the first argument (conditional replacement) or a by=
argument, then it fails.
MPR_kalman2
mitigates this by using .SD
, which is data.table
's way of replacing the data-to-be-used with the "S
ubset of the D
ata" (ref). But it's still not taking advantage of data.table
's significant efficiencies in dealing in-memory with groups.
MPR_kalman3
works on this by grouping outside, still using with
but not (as in 2
) in a more friendly way.
MPR_kalman4
removes the use of with
, since really the MPR
visible to ave
is only within each group anyway. And then when you think about it, since ave
is given no grouping variables, it really just passes all of the MPR
data straight-through to myimpute
. From this, we have MPR_kalman5
, a direct method that is along the normal patterns of data.table
.
While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table
's ways).
Upvotes: 2