Reputation: 1643
I'm basically looking to extract FIPS from ZIP codes but they are stored as a single string: "00001_00001"
which needs to be parsed and matched then the most frequent match finally extracted.
Here's some working code I've come up with but I have a very large dataset so I need to optimize.
library(tidyverse)
test <-
data.frame(
zips = c("00001", "00002", "00003", "00004"),
fips = c("1", "1", "2", "2")
)
sample <-
data.frame(
unit = c(1, 2, 3, 4, 5),
zips = c(
"00001_00002",
"00001_00002_00002",
"00001_00002_00003_00003",
"00002",
"00001_00003_00004"
)
)
dat <- sample %>%
group_by(unit) %>%
mutate(zips = str_split(zips, "_")) %>%
unnest(zips) %>%
left_join(test) %>%
count(unit, fips) %>%
slice(which.max(n))
Upvotes: 0
Views: 64
Reputation: 8880
try using separate_rows
sample %>%
separate_rows(zips, sep = "_") %>%
left_join(test) %>%
count(unit, fips) %>%
group_by(unit) %>%
slice(which.max(n))
Upvotes: 0
Reputation: 1780
To optimize your code you can rewrite it with data.table:
library(data.table)
# Create data.tables
sample_dt <- data.table(sample, key = "unit")
test_dt <- data.table(test)
out <- sample_dt[, list(zips = unlist(strsplit(zips, "_"))), by = unit][
test_dt, on = "zips"][
, .(count = .N), by = .(unit,fips)][
, .SD[which.max(count)], by = unit]
out
Upvotes: 2