Reputation: 2770
I a looking for an efficient way to replace missing data in data.table
.
When 'description' is "", I want to replace it with the 'description' with the latest date with the same id. So for seqs=3
, I want to replace "" with "Red" because seqs=2
is not "" and is the latest date. Seqs=4
would just remain "" because there is no other id.
I am working with several million rows and am using data.table
(so I would love to not convert to the tidyverse
)
set.seed(3)
data.frame(
seqs = 1:11 ,
id = c( 1 , 1 , 1 , 2 , 3 , 3 , 4 ,4 , 4, 4 , 5 ) ,
description = c("Red","Red","","","Blue","Blue","Red","Red","","Red","Blue"),
dates = sample( 2000:2020, 11, TRUE)
)
)
Upvotes: 0
Views: 37
Reputation: 19191
One approach using last
library(data.table)
setDT(dt)
dt[order(id, dates),
description := ifelse(description == "", last(description), description), by = id]
dt
seqs id description dates
1: 1 1 Red 2004
2: 2 1 Red 2011
3: 3 1 Red 2006
4: 4 2 2003
5: 5 3 Blue 2007
6: 6 3 Blue 2010
7: 7 4 Red 2007
8: 8 4 Red 2019
9: 9 4 Red 2009
10: 10 4 Red 2007
11: 11 5 Blue 2015
dt <- structure(list(seqs = 1:11, id = c(1, 1, 1, 2, 3, 3, 4, 4, 4,
4, 5), description = c("Red", "Red", "Red", "", "Blue", "Blue",
"Red", "Red", "Red", "Red", "Blue"), dates = c(2004L, 2011L,
2006L, 2003L, 2007L, 2010L, 2007L, 2019L, 2009L, 2007L, 2015L
)), class = "data.frame", row.names = c(NA, -11L))
Upvotes: 2