Speedy replacement of NA by strings using Data.Table or Rcpp in R

Question

I have a big table: 10M rows by 33 columns, of which 28 columns have some NA values. These NA values need to be patched using locf(). I read a few threads (efficiently locf by groups in a single R data.table and na.locf and inverse.rle in Rcpp) on this topic. However, these threads are about replacing numeric vectors. I am not too familiar with Rcpp so I don't know how to change their code to cater to strings---my data are all strings.

Here are my sample data:

Input Data

Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being", 
    NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens", 
    NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA, 
    "Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China", 
    "China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East", 
    NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021, 
    NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N", 
    "N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA, 
    NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, 8L), class = "data.frame")

Here's my code using data.table:

data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]

Expected Output:

Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113), 
    Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456", 
    "GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being", 
    "Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
    ), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens", 
    "Homo Sapiens", "Passeridae", "Passeridae", "Passeridae", 
    "Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia", 
    "Asia", "Australia", "Australia", "Australia", "Australia"
    ), SL2_Country = c("China", "China", "China", "China", "Australia", 
    "Australia", "Australia", "Australia"), SL3_Direction = c("East", 
    "East", "East", "East", "West", "West", "West", "West"), 
    Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012, 
    2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD", 
    "TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes", 
    "Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4, 
    44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO", 
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country", 
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue", 
"Quantity"), row.names = c(NA, -8L), class = "data.frame")

While the above code takes fraction of second to execute, it takes ~10 minutes to process one column in my original data-set, which translates to ~280 minutes to process 28 columns even with data.table.

I am assuming that I am not really utilizing the power of data.table above. I am not really sure. I'd sincerely appreciate any help to speed up na.locf() function.

Is there any more efficient method to replace NA above?

Speedy replacement of NA by strings using Data.Table or Rcpp in R

Answers (1)

Related Questions