Reputation: 481
I want to get county names associated with all zip codes in a dataset. I'm getting relatively fast results with a data frame (although I feel it can be done much faster) but not so with a data.table, even with some optimization. Is there a way to further speed it up using data frames or data.tables?
Here is my initialization (based on this answer):
library(noncensus)
data(zip_codes)
data(counties)
counties$fips <- as.numeric(paste0(counties$state_fips, counties$county_fips))
Computation with data frames (second is slightly faster, as expected) - 20, 16 seconds:
system.time(sapply(zip_codes$fips, function(x) subset(counties, fips == x)$county_name))
system.time(sapply(zip_codes$fips, function(x) counties[counties$fips==x,]$county_name))
Computation with data tables - 60, 43 seconds:
zip_codes.dt <- data.table(zip_codes)
counties.dt <- data.table(zip_codes)
system.time(sapply(zip_codes.dt$fips, function(x) subset(counties.dt, fips == x)$county_name))
setkey(counties.dt, fips) # optimizing
system.time(sapply(zip_codes.dt$fips, function(x) counties.dt[.(x)]$county_name))
Upvotes: 0
Views: 179