Reputation: 919
I'm working with Census (CTPP) data, and the GEOID field is a long string that contains lots of geographic information. The format of this string changes for various Census tables, but they provide a code lookup. Here are a sample GEOID and format 'code'. (The parts I can already parse have been removed. This is the part of the GEOID I can't parse.)
geoid <- "0202000000126"
format <- "ssccczzzzzzzz"
This means that the first two characters ("02"
) signify the state (Alaska), the next three ("020"
) are the county, and the remaining characters are the zone.
I have a table of these geoid/format pairs, and the format can be different for each row.
df <- data.frame(
geoid = c(
"0224230",
"0202000000126"
),
format = c(
"ssppppp",
"ssccczzzzzzzz"
)
)
# A tibble: 2 x 2
geoid format
<chr> <chr>
1 0224230 ssppppp
2 0202000000126 ssccczzzzzzzz
What I'd like to do is break up the geoid
column into columns for each geography like so:
# A tibble: 2 x 6
geoid format s p c z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 24230 NA NA
2 0202000000126 ssccczzzzzzzz 02 NA 020 00000126
I've looked at several approaches. extract()
from stringr
looked promising. I'm also pretty sure I'll need a custom function that I mapply(?)/map over my data frame.
Upvotes: 2
Views: 195
Reputation: 67778
A base
alternative:
geo_codes <- c("s", "c", "p", "z")
# get starting position and lengths of consecutive characters in 'format'
g <- gregexpr("(.)\\1+", df$format)
# use the result above to extract corresponding substrings from 'geoid'
geo <- regmatches(df$geoid, g)
# select first element in each run of 'format' and split
# used to name substrings from above
fmt <- strsplit(gsub("(.)\\1+", "\\1", df$format), "")
# for each element in 'geo' and 'fmt',
# 1. create a named vector
# 2. index the vector with 'geo_codes'
# 3. set names of the full length vector
t(mapply(function(geo, fmt){
setNames(setNames(geo, fmt)[geo_codes], geo_codes)},
geo, fmt))
# s c p z
# [1,] "02" NA "24230" NA
# [2,] "02" "020" NA "00000126"
Another alternative,
geo <- strsplit(df$geoid, "")
fmt <- strsplit(df$format, "")
t(mapply(function(geo, fmt) unlist(lapply(split(geo, factor(fmt, levels = geo_codes)), function(x){
if(length(x)) paste(x, collapse = "") else NA})), geo, fmt))
My first alternative is about 2 times faster than the second, benchmarked on 2e5 rows.
Upvotes: 2
Reputation: 1054
A tidyverse solution:
library(tidyverse)
create_new_code <- function(id, format, char) {
format %>%
str_locate_all(paste0(char, "*", char)) %>%
unlist() %>%
{substr(id, .[1], .[2])}
}
create_new_codes <- function(id, format) {
c("s", "p", "c", "z") %>%
set_names() %>%
map(create_new_code, id = id, format = format)
}
bind_cols(df,
with(df, map2_df(geoid, format, create_new_codes)))
# geoid format s p c z
#1 0224230 ssppppp 02 24230 <NA> <NA>
#2 0202000000126 ssccczzzzzzzz 02 <NA> 020 00000126
Upvotes: 0
Reputation: 919
As is so often the case, writing up the question and the minimum example helped me simplify the problem and identify a solution. I'm sure there is a fancier solution out there, but this is what I came up with, and it's easy(ish) to get your head around.
While the formats vary, there are a limited number of unique characters. In the toy example in this problem, only s, c, p, z
. So here's what I did:
First, I created a function that takes a single format string, a single geoid string, and a single subgeo character/code. The function determines which character positions in format
match subgeo
and then returns those positions from geoid
.
extract_sub_geo <- function(format, geoid, subgeo) {
geoid_v <- unlist(strsplit(geoid, ""))
format_v <- unlist(strsplit(format, ""))
positions <- which(format_v == subgeo)
result <- paste(geoid_v[positions], collapse = "")
return(result)
}
extract_sub_geo("ssccczzzzzzzz", "0202000000126", "s")
[1] "02"
I then looped over each unique code and used pmap()
to apply the function to my entire data frame.
geo_codes <- c("s", "c", "p", "z")
for (code in geo_codes) {
df <- df %>%
mutate(
!!code := pmap_chr(list(format, remainder, !!(code)), extract_sub_geo)
)
}
# A tibble: 2 x 6
geoid format s c p z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 "" 02000 ""
2 0202000000126 ssccczzzzzzzz 02 020 "" 00000126
Probably cleaner to do the loop in base R instead of dplyr.
Upvotes: 0