Reputation: 4298
I have a big table: 10M rows by 33 columns, of which 28 columns have some NA values. These NA values need to be patched using locf()
. I read a few threads (efficiently locf by groups in a single R data.table and na.locf and inverse.rle in Rcpp) on this topic. However, these threads are about replacing numeric vectors. I am not too familiar with Rcpp
so I don't know how to change their code to cater to strings---my data are all strings.
Here are my sample data:
Input Data
Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being",
NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens",
NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA,
"Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China",
"China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East",
NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021,
NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N",
"N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA,
NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, 8L), class = "data.frame")
Here's my code using data.table
:
data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]
Expected Output:
Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being",
"Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens",
"Homo Sapiens", "Passeridae", "Passeridae", "Passeridae",
"Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia",
"Asia", "Australia", "Australia", "Australia", "Australia"
), SL2_Country = c("China", "China", "China", "China", "Australia",
"Australia", "Australia", "Australia"), SL3_Direction = c("East",
"East", "East", "East", "West", "West", "West", "West"),
Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012,
2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD",
"TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes",
"Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, -8L), class = "data.frame")
While the above code takes fraction of second to execute, it takes ~10 minutes to process one column in my original data-set, which translates to ~280 minutes to process 28 columns even with data.table
.
I am assuming that I am not really utilizing the power of data.table
above. I am not really sure. I'd sincerely appreciate any help to speed up na.locf()
function.
Is there any more efficient method to replace NA
above?
Upvotes: 1
Views: 415
Reputation: 17642
I simplified the problem for the purpose of this example, but I guess it is easy enough to generalize. The code below defines the locppf
function in Rcpp using C++11 syntax:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
using Map = std::unordered_map<double, int> ;
using Pair = Map::value_type ;
// [[Rcpp::export]]
CharacterVector locppf(NumericVector g, CharacterVector s) {
auto n = g.size() ;
CharacterVector out = clone(s) ;
Map map ;
for(int i=n-1; i>=0; i--){
double value = g[i] ;
auto it = map.find( value ) ;
if( it == map.end() ){
map.insert( Pair(value, i) ) ;
} else {
// if the current value is NA, replace it with the data at correct idx
auto current = s[i] ;
if( CharacterVector::is_na( current ) ){
out[i] = s[ it->second ] ;
} else {
it->second = i ;
}
}
}
return out ;
}
The idea is to define a map to track the index of the last time we've seen something that was not NA
in the group.I'm using std::unordered_map<double, int>
as the map because your example also used a numeric vector for the group.
Let's break the relevant nuggets:
if( it == map.end() ){
map.insert( Pair(value, i) ) ;
}
Here we check if the map has already seen the current value, and if not we retain the current index.
auto current = s[i] ;
if( CharacterVector::is_na( current ) ){
out[i] = s[ it->second ] ;
} else {
it->second = i ;
}
Here we check if the current value is NA with CharacterVector::is_na
.
If it is we fill the result vector with the value that is in the index we retained earlier.
If not, we change the index that is remembered by the map for this group.
Now let's give ourselves some data:
library("zoo")
library("dplyr")
library("data.table")
with_holes <- function(x, p = .2){
n <- length(x)
x[ sample(n, n*p) ] <- NA
x
}
n <- 1e6
x <- sample( as.numeric(1:100), n, replace= TRUE )
y <- with_holes( sample( letters, n, replace = TRUE) )
d <- data_frame( x = x, y = y )
And measure timing with various options:
Using dplyr syntax with group_by
, mutate
and na.locf
> system.time( d %>% group_by(x) %>% mutate( y = na.locf(y, fromLast = TRUE, na.rm = FALSE) ) )
user system elapsed
0.173 0.023 0.198
Using data.table
syntax with na.locf
. I don't guarantee this is the best data.table
way to do this though.
> d2 <- as.data.table(d)
> system.time( d2[ , y := na.locf(y, fromLast = TRUE, na.rm = FALSE) , x ] )
user system elapsed
0.159 0.030 0.188
With out custom locppf
function:
> system.time( locppf(d$x, d$y) )
user system elapsed
0.028 0.001 0.028
Upvotes: 6