R_and_Python_noob
R_and_Python_noob

Reputation: 129

Subset strings in R

One of the strings in my vector (df$location1) is the following:

Potomac, MD 20854\n(39.038266, -77.203413)

Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.

So far I have been able to isolate the lat. long. data into a separate column by doing the following:

df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)

I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.

I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx

Upvotes: 1

Views: 835

Answers (5)

Brindle Cruncher
Brindle Cruncher

Reputation: 120

So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.

library(stringi)
library(dplyr)
library(purrr)

rgexVec <- c("[\\w\\s-]+(?=,)", 
         "[A-Z]{2}", 
         "\\d+(?=\\n)", 
         "[\\d-\\.]+(?=,)", 
         "[\\d-\\.]+(?=\\))")
varNames <- c("city", 
          "state", 
          "zip", 
          "lat", 
          "long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
  extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
  names(extractedVal) <- vn
  extractedVal %>% as_tibble()
})

Upvotes: 0

akrun
akrun

Reputation: 887213

Here is an option using base R

read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))), 
  header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"), 
      stringsAsFactors = FALSE)
#    City State   Zip Latitude Longitude
#1 Potomac    MD 20854 39.03827 -77.20341

Upvotes: 1

mt1022
mt1022

Reputation: 17299

You can try strsplit or data.table::tstrsplit(strsplit + transpose):

> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"

[[2]]
[1] "MD 20854"

[[3]]
[1] "39.038266"

[[4]]
[1] "-77.203413"

More generally, you can do this:

library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')

The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".

In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:

# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')

Upvotes: 1

tgrrr
tgrrr

Reputation: 754

\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

Upvotes: -1

www
www

Reputation: 39154

We can also use the separate function from the tidyr package (part of the tidyverse package).

library(tidyverse)

# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
                  stringsAsFactors = FALSE)
dat
#                                         Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)

# Separate the Data column
dat2 <- dat %>%
  separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
           sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
#      City State   Zip  Latitude  Longitude
# 1 Potomac    MD 20854 39.038266 -77.203413

Upvotes: 2

Related Questions