user3188390
user3188390

Reputation: 613

How to modify regex expression in R?

I have a CSV file as

AdvertiserName,Market
Wells Fargo,Gary INMetro Chicago IL Metro
EMC,Los Angeles CAMetro Boston MA Metro
Apple,Cupertino CA Metro

and the expression in regex is

res <- 
 gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
  xx$Market)

And now the 'Market' column is like 'Gary IN MetroChicago IL Metro' instead of 'Gary INMetro Chicago IL Metro' and the CSV file is like

AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro

How to modify the expression in regex so as to get the desired output as

AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA

New to R. Any help is appreciated.

Upvotes: 1

Views: 92

Answers (2)

Sven Hohenstein
Sven Hohenstein

Reputation: 81753

Here's a way with strsplit:

# read file
dat <- read.csv("filename.csv", stringsAsFactors = FALSE)

# split strings
splitted <- strsplit(dat$CampaignName, 
                     "( (?=[A-Z]{2}))|((?<=[A-Z]{2}) [A-Z][a-z]+)", perl = TRUE)
# [[1]]
# [1] "Gary"    "IN"      "Chicago" "IL"     
#
# [[2]]
# [1] "Los Angeles" "CA"          "Boston"      "MA"         
#
# [[3]]
# [1] "Cupertino" "CA"       

# create one data frame
setNames(as.data.frame(do.call(rbind, 
                               mapply(cbind, 
                                      dat$AdvertiserName, 
                                      lapply(splitted, function(x)
                                        matrix(x, ncol = 2, byrow = TRUE))))), 
         c("AdvertiserName", "City", "State"))
#   AdvertiserName        City State
# 1    Wells Fargo        Gary    IN
# 2    Wells Fargo     Chicago    IL
# 3            EMC Los Angeles    CA
# 4            EMC      Boston    MA
# 5          Apple   Cupertino    CA

Upvotes: 2

Hugh
Hugh

Reputation: 16099

This is a bit dirty. Edits welcome.

 # Read in the csv file (saved here as a .txt) to 
    y <- readLines("Stackoverflow20140226.txt")


# Every time see a state, shove a comma in
for (i in seq(y)){
y[[i]] <- gsub("([A-Z]{2}) ", "\\1, ", y[[i]])
}

tf <- tempfile()
writeLines(y, tf)

# Trick the csv file into thinking there are more columns
ncol <- max(count.fields(tf, sep = ","))
x <- read.csv(tf, fill = TRUE, header = FALSE, skip=1,
         col.names = paste("V", seq_len(ncol), sep = ""))
unlink(tf)
# Use reshape to melt the data frame
library(reshape2)
xx <- melt(x, id.vars=1, measure.vars = 2:ncol(x))

xx$variable <- NULL
names(xx) <- c("AdvertiserName", "CampaignName")

xx
  AdvertiserName     CampaignName
1    Wells Fargo          Gary IN
2            EMC   Los Angeles CA
3          Apple     Cupertino CA
4    Wells Fargo  MetroChicago IL
5            EMC   MetroBoston MA
6          Apple            Metro
7    Wells Fargo            Metro
8            EMC            Metro
9          Apple        

Upvotes: 0

Related Questions