Reputation: 613
I have a CSV file as
AdvertiserName,Market
Wells Fargo,Gary INMetro Chicago IL Metro
EMC,Los Angeles CAMetro Boston MA Metro
Apple,Cupertino CA Metro
and the expression in regex is
res <-
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4',
xx$Market)
And now the 'Market' column is like 'Gary IN MetroChicago IL Metro' instead of 'Gary INMetro Chicago IL Metro' and the CSV file is like
AdvertiserName,CampaignName
Wells Fargo,Gary IN MetroChicago IL Metro
EMC,Los Angeles CA MetroBoston MA Metro
Apple,Cupertino CA Metro
How to modify the expression in regex so as to get the desired output as
AdvertiserName,City,State
Wells Fargo,Gary,IN
Wells Fargo,Chicago,IL
EMC,Los Angeles,CA
EMC,Boston,MA
Apple,Cupertino,CA
New to R. Any help is appreciated.
Upvotes: 1
Views: 92
Reputation: 81753
Here's a way with strsplit
:
# read file
dat <- read.csv("filename.csv", stringsAsFactors = FALSE)
# split strings
splitted <- strsplit(dat$CampaignName,
"( (?=[A-Z]{2}))|((?<=[A-Z]{2}) [A-Z][a-z]+)", perl = TRUE)
# [[1]]
# [1] "Gary" "IN" "Chicago" "IL"
#
# [[2]]
# [1] "Los Angeles" "CA" "Boston" "MA"
#
# [[3]]
# [1] "Cupertino" "CA"
# create one data frame
setNames(as.data.frame(do.call(rbind,
mapply(cbind,
dat$AdvertiserName,
lapply(splitted, function(x)
matrix(x, ncol = 2, byrow = TRUE))))),
c("AdvertiserName", "City", "State"))
# AdvertiserName City State
# 1 Wells Fargo Gary IN
# 2 Wells Fargo Chicago IL
# 3 EMC Los Angeles CA
# 4 EMC Boston MA
# 5 Apple Cupertino CA
Upvotes: 2
Reputation: 16099
This is a bit dirty. Edits welcome.
# Read in the csv file (saved here as a .txt) to
y <- readLines("Stackoverflow20140226.txt")
# Every time see a state, shove a comma in
for (i in seq(y)){
y[[i]] <- gsub("([A-Z]{2}) ", "\\1, ", y[[i]])
}
tf <- tempfile()
writeLines(y, tf)
# Trick the csv file into thinking there are more columns
ncol <- max(count.fields(tf, sep = ","))
x <- read.csv(tf, fill = TRUE, header = FALSE, skip=1,
col.names = paste("V", seq_len(ncol), sep = ""))
unlink(tf)
# Use reshape to melt the data frame
library(reshape2)
xx <- melt(x, id.vars=1, measure.vars = 2:ncol(x))
xx$variable <- NULL
names(xx) <- c("AdvertiserName", "CampaignName")
xx
AdvertiserName CampaignName
1 Wells Fargo Gary IN
2 EMC Los Angeles CA
3 Apple Cupertino CA
4 Wells Fargo MetroChicago IL
5 EMC MetroBoston MA
6 Apple Metro
7 Wells Fargo Metro
8 EMC Metro
9 Apple
Upvotes: 0