Kyle Ward
Kyle Ward

Reputation: 919

R How to split a column of strings into multiple columns using a format code/string?

I'm working with Census (CTPP) data, and the GEOID field is a long string that contains lots of geographic information. The format of this string changes for various Census tables, but they provide a code lookup. Here are a sample GEOID and format 'code'. (The parts I can already parse have been removed. This is the part of the GEOID I can't parse.)

geoid <- "0202000000126"
format <- "ssccczzzzzzzz"

This means that the first two characters ("02") signify the state (Alaska), the next three ("020") are the county, and the remaining characters are the zone.

I have a table of these geoid/format pairs, and the format can be different for each row.

df <- data.frame(
  geoid = c(
    "0224230",
    "0202000000126"
  ),
  format = c(
    "ssppppp",
    "ssccczzzzzzzz"
  )
)
# A tibble: 2 x 2
  geoid         format       
  <chr>         <chr>        
1 0224230       ssppppp      
2 0202000000126 ssccczzzzzzzz

What I'd like to do is break up the geoid column into columns for each geography like so:

# A tibble: 2 x 6
  geoid         format        s     p     c     z       
  <chr>         <chr>         <chr> <chr> <chr> <chr>   
1 0224230       ssppppp       02    24230 NA    NA      
2 0202000000126 ssccczzzzzzzz 02    NA    020   00000126

I've looked at several approaches. extract() from stringr looked promising. I'm also pretty sure I'll need a custom function that I mapply(?)/map over my data frame.

Upvotes: 2

Views: 195

Answers (3)

Henrik
Henrik

Reputation: 67778

A base alternative:

geo_codes <- c("s", "c", "p", "z")

# get starting position and lengths of consecutive characters in 'format'
g <- gregexpr("(.)\\1+", df$format)

# use the result above to extract corresponding substrings from 'geoid' 
geo <- regmatches(df$geoid, g)

# select first element in each run of 'format' and split
# used to name substrings from above
fmt <- strsplit(gsub("(.)\\1+", "\\1", df$format), "")

# for each element in 'geo' and 'fmt',
# 1. create a named vector
# 2. index the vector with 'geo_codes' 
# 3. set names of the full length vector
t(mapply(function(geo, fmt){
  setNames(setNames(geo, fmt)[geo_codes], geo_codes)},
  geo, fmt))
#      s    c     p       z         
# [1,] "02" NA    "24230" NA        
# [2,] "02" "020" NA      "00000126"

Another alternative,

geo <- strsplit(df$geoid, "")
fmt <- strsplit(df$format, "")

t(mapply(function(geo, fmt) unlist(lapply(split(geo, factor(fmt, levels = geo_codes)), function(x){
  if(length(x)) paste(x, collapse = "") else NA})), geo, fmt))

My first alternative is about 2 times faster than the second, benchmarked on 2e5 rows.

Upvotes: 2

Rafael Toledo
Rafael Toledo

Reputation: 1054

A tidyverse solution:

library(tidyverse)

create_new_code <- function(id, format, char) {
    format %>% 
        str_locate_all(paste0(char, "*", char)) %>% 
        unlist() %>% 
        {substr(id, .[1], .[2])}
}

create_new_codes <- function(id, format) {
    c("s", "p", "c", "z") %>% 
        set_names() %>% 
        map(create_new_code, id = id, format = format)
}

bind_cols(df, 
          with(df, map2_df(geoid, format, create_new_codes)))

#          geoid        format  s     p    c        z
#1       0224230       ssppppp 02 24230 <NA>     <NA>
#2 0202000000126 ssccczzzzzzzz 02  <NA>  020 00000126

Upvotes: 0

Kyle Ward
Kyle Ward

Reputation: 919

As is so often the case, writing up the question and the minimum example helped me simplify the problem and identify a solution. I'm sure there is a fancier solution out there, but this is what I came up with, and it's easy(ish) to get your head around.

While the formats vary, there are a limited number of unique characters. In the toy example in this problem, only s, c, p, z. So here's what I did:

First, I created a function that takes a single format string, a single geoid string, and a single subgeo character/code. The function determines which character positions in format match subgeo and then returns those positions from geoid.

extract_sub_geo <- function(format, geoid, subgeo) {
  geoid_v <- unlist(strsplit(geoid, ""))
  format_v <- unlist(strsplit(format, ""))
  positions <- which(format_v == subgeo)
  result <- paste(geoid_v[positions], collapse = "")
  return(result)
}

extract_sub_geo("ssccczzzzzzzz", "0202000000126", "s")
[1] "02"

I then looped over each unique code and used pmap() to apply the function to my entire data frame.

geo_codes <- c("s", "c", "p", "z")

for (code in geo_codes) {
  df <- df %>%
    mutate(
      !!code := pmap_chr(list(format, remainder, !!(code)), extract_sub_geo)
    )
}
# A tibble: 2 x 6
  geoid         format        s     c     p     z       
  <chr>         <chr>         <chr> <chr> <chr> <chr>   
1 0224230       ssppppp       02    ""    02000 ""      
2 0202000000126 ssccczzzzzzzz 02    020   ""    00000126

Probably cleaner to do the loop in base R instead of dplyr.

Upvotes: 0

Related Questions