Reputation: 10578

Extracting capital words and extracting the last word in a string

I have a df that looks like this:

df <- data.frame(
    x = c(
        "800 Block of MAIN ST",
        "100 Block of CHESTNUT AV", 
        "BAY ST / WELLINGTON ST", 
        "LARKIN ST / ELLIS ST",
        "MAPLE ST / WELLINGTON ST", 
        "MEANDERING RD / MAIN ST"),
    y = rnorm(6))

I want to extract the first street name and the last street type.

Desired Output:

                         x          y  x.1        x.2
1     800 Block of MAIN ST -0.6745405  MAIN       ST
2 100 Block of CHESTNUT AV -1.1316017  CHESTNUT   AV 
3   BAY ST / WELLINGTON ST  1.2887577  BAY        ST
4     LARKIN ST / ELLIS ST  1.4606264  LARKIN     ST
5 MAPLE ST / WELLINGTON ST  0.6538595  MAPLE      ST
6  MEANDERING RD / MAIN ST  0.8472322  MEANDERING ST

Upvotes: 3

Answers (3)

David Arenburg

Reputation: 92300

Here's a similar solution using a single regex expression combined with the new tstrsplit function from the development version of data.table

library(data.table) # v1.9.5+
setDT(df)[, c("street", "type") := 
              tstrsplit(sub(".*?([A-Z]{3,}).*([A-Z]{2,})", "\\1,\\2", x), ",")]
df
#                           x          y     street type
# 1:     800 Block of MAIN ST -1.4391801       MAIN   ST
# 2: 100 Block of CHESTNUT AV  1.4917789   CHESTNUT   AV
# 3:   BAY ST / WELLINGTON ST -0.0369405        BAY   ST
# 4:     LARKIN ST / ELLIS ST  0.7320230     LARKIN   ST
# 5: MAPLE ST / WELLINGTON ST  0.7189120      MAPLE   ST
# 6:  MEANDERING RD / MAIN ST -0.9836794 MEANDERING   ST

Basically, the idea here is to capture both groups within a single sub call, concatenate them with a comma (you can choose something else if you like) and then perform a transpose sting split (tstrsplit) in order to convert them into two separate columns while creating them by reference (using the := operator)

Upvotes: 3

Jota

Reputation: 17621

df <- within(df, st_name <- sub(".*?([A-Z]{3,}).*", "\\1", x, perl=TRUE))

df <- within(df, st_type <- sub(".+? ([A-Z]+)$", "\\1", x, perl=TRUE))
#                         x           y    st_name st_type
#1     800 Block of MAIN ST  1.92908789       MAIN      ST
#2 100 Block of CHESTNUT AV  0.02487045   CHESTNUT      AV
#3   BAY ST / WELLINGTON ST -2.33411242        BAY      ST
#4     LARKIN ST / ELLIS ST -1.17946144     LARKIN      ST
#5 MAPLE ST / WELLINGTON ST  0.12913797      MAPLE      ST
#6  MEANDERING RD / MAIN ST -0.94150930 MEANDERING      ST

Or if you aren't fond of using within:

df$st_name <- sub(".*?([A-Z]{3,}).*", "\\1", df$x, perl=TRUE)
df$st_type <- sub(".+? ([A-Z]+)$", "\\1", df$x, perl=TRUE)

Upvotes: 3

Pierre L

Reputation: 28461

library(stringr)
df[,c("street", "type")] <- list(str_extract(df$x, "[A-Z]{3,}"), str_extract(df$x, "[A-Z]+$"))
#                          x          y     street type
# 1     800 Block of MAIN ST  0.7787495       MAIN   ST
# 2 100 Block of CHESTNUT AV -0.7069777   CHESTNUT   AV
# 3   BAY ST / WELLINGTON ST -0.2365061        BAY   ST
# 4     LARKIN ST / ELLIS ST  0.1399500     LARKIN   ST
# 5 MAPLE ST / WELLINGTON ST -0.3423978      MAPLE   ST
# 6  MEANDERING RD / MAIN ST  0.6434969 MEANDERING   ST

Upvotes: 4

Extracting capital words and extracting the last word in a string

Answers (3)

Related Questions