Reputation: 77
I have a string in the below format:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
My aim is to get the final result in a dataframe as below:
first_name cust_id start_date end_date performance cust_notes
James(Mr) 98503 2015-05-20 2015-05-20 best ZZW_LG,WGE,zonaire
I ran the following code:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
split_by_comma <- strsplit(a,",")
split_by_equal <- lapply(split_by_comma,strsplit,"=")
Since the custid had got additional commas and brackets, I am not getting desired result.
Please note that brackets in first name are genuine and needed as it is.
Upvotes: 1
Views: 1658
Reputation: 1
Late reply, but posted it since its very simple to understand and implement without using any additional packages
rawdf = read.csv("<your file path>", header = F, sep = ",", stringsAsFactors = F)
# Get the first row of the dataframe and transpose it into a column of a df
colnames = data.frame(t(rawdf[1,]))
# Split the values of the single column df created above into its key value
# pairs which are separated by '=' and save in a vector
colnames = unlist(strsplit(as.character(colnames$X1), "="))
# Pick up all the odd indexed values from the above vector (all odd places
# are colnames and even places the values associated with them)
colnames = colnames[seq(1,length(colnames),2)]
# Assign the extracted column names from the vector above to your original data frame
colnames(rawdf) = colnames
# Use the regex to extract the value in each field of the original df by
# replacing the 'Key=' pattern present in each field with an empty string
for(i in 1:dim(rawdf)[2]) rawdf[,i] = gsub(paste(colnames[i],"=",sep=""), "", rawdf[,i])
Upvotes: 0
Reputation: 78792
If your string format holds true, this might be a quick solution:
library(httr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
dat <- data.frame(parse_url(sprintf("?%s", gsub(",[[:space:]]+", "&", a)))$query,
stringsAsFactors=FALSE)
library(tidyr)
library(dplyr)
mutate(separate(dat, cust_id, into=c("cust_id", "cust_notes"), sep="\\("),
cust_notes=gsub("\\)", "", cust_notes))
## first_name cust_id cust_notes StartDate EndDate performance
## 1 James(Mr) 98503 ZZW_LG,WGE,zonaire 2015-05-20 2015-05-20 best
Extrapolation:
gsub(",[[:space:]]+", "&", a)
makes the parameters look like a components of a URL query string.sprintf(…)
make it look like an actual query stringparse_url
(from httr
) will separate the key/value pairs out and stick them in a list (named query
) in the returned listdata.frame
will, well…separate
will split the cust_id
column for you at the (
into two columnsmutate
will remove the )
in the new cust_notes
columnHere's the whole thing as a "pipe":
library(httr)
library(tidyr)
library(dplyr)
library(magrittr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
a %>%
gsub(",[[:space:]]+", "&", .) %>%
sprintf("?%s", .) %>%
parse_url() %>%
extract2("query") %>%
data.frame(stringsAsFactors=FALSE) %>%
separate(cust_id, into=c("cust_id", "cust_notes"), sep="\\(") %>%
mutate(cust_notes=gsub("\\)", "", cust_notes))
which matches the extrapolation and is (IMO) easier to follow.
Upvotes: 0
Reputation: 67968
You need to split by this.
,(?![^()]*\\))
You need lookahead
.This will not split by ,
within ()
.See demo.
https://regex101.com/r/uF4oY4/82
To get desired result use
split_by_comma <- strsplit(a,",(?![^()]*\\))",perl=TRUE)
split_by_equal <- lapply(split_by_comma,strsplit,"=")
Upvotes: 1