Fast mapping between two data.table datasets

Question

I want to get county names associated with all zip codes in a dataset. I'm getting relatively fast results with a data frame (although I feel it can be done much faster) but not so with a data.table, even with some optimization. Is there a way to further speed it up using data frames or data.tables?

Here is my initialization (based on this answer):

library(noncensus)
data(zip_codes)
data(counties)
counties$fips <- as.numeric(paste0(counties$state_fips, counties$county_fips))

Computation with data frames (second is slightly faster, as expected) - 20, 16 seconds:

system.time(sapply(zip_codes$fips, function(x) subset(counties, fips == x)$county_name))
system.time(sapply(zip_codes$fips, function(x) counties[counties$fips==x,]$county_name))

Computation with data tables - 60, 43 seconds:

zip_codes.dt <- data.table(zip_codes)
counties.dt <- data.table(zip_codes)
system.time(sapply(zip_codes.dt$fips, function(x) subset(counties.dt, fips == x)$county_name))
setkey(counties.dt, fips)  # optimizing
system.time(sapply(zip_codes.dt$fips, function(x) counties.dt[.(x)]$county_name))

Naumz · Accepted Answer

Based on @Frank's suggestion, reading the vignettes here and the package documentation helped me find the answer using data.table.

Here it is:

zip_codes.dt[counties.dt, on="fips", county_name := county_name]

Fast mapping between two data.table datasets

Answers (1)

Related Questions