SCDCE
SCDCE

Reputation: 1643

Extract most frequent matched value from a data.frame column list

I'm basically looking to extract FIPS from ZIP codes but they are stored as a single string: "00001_00001" which needs to be parsed and matched then the most frequent match finally extracted.

Here's some working code I've come up with but I have a very large dataset so I need to optimize.

library(tidyverse)
test <-
  data.frame(
    zips = c("00001", "00002", "00003", "00004"),
    fips = c("1", "1", "2", "2")
  )

sample <-
  data.frame(
    unit = c(1, 2, 3, 4, 5),
    zips = c(
      "00001_00002",
      "00001_00002_00002",
      "00001_00002_00003_00003",
      "00002",
      "00001_00003_00004"
    )
  )

dat <- sample %>%
  group_by(unit) %>%
  mutate(zips = str_split(zips, "_")) %>%
  unnest(zips) %>%
  left_join(test) %>%
  count(unit, fips) %>%
  slice(which.max(n))

Upvotes: 0

Views: 64

Answers (2)

Yuriy Saraykin
Yuriy Saraykin

Reputation: 8880

try using separate_rows

sample %>% 
  separate_rows(zips, sep = "_") %>% 
  left_join(test) %>%
  count(unit, fips) %>%
  group_by(unit) %>% 
  slice(which.max(n))

Upvotes: 0

mharinga
mharinga

Reputation: 1780

To optimize your code you can rewrite it with data.table:

library(data.table)

# Create data.tables
sample_dt <- data.table(sample, key = "unit")
test_dt <- data.table(test)

out <- sample_dt[, list(zips = unlist(strsplit(zips, "_"))), by = unit][
     test_dt, on = "zips"][
       , .(count = .N), by = .(unit,fips)][
         , .SD[which.max(count)], by = unit]
out

Upvotes: 2

Related Questions