Antonio
Antonio

Reputation: 158

Split string into multiple rows by capital letters with cSplit

I have survey data. Some questions allowed for multiple answers. In my data, the different answers are separated by a comma. I want to add a new row in the dataframe for each choice. So I have something like this:

survey$q1 <- c("I like this", "I like that", "I like this, but not much",
 "I like that, but not much", "I like this,I like that", 
"I like this, but not much,I like that")

If commas were only there to divide the multiple choices I'd use:

survey <- cSplit(survey, "q1", ",", direction = "long")

and get the desired result. Given some commas are part of the answer, I tried using comma followed by capital letter as a divider:

survey <- cSplit(survey, "q1", ",(?=[A-Z])", direction = "long")

But for some reason it does not work. It does not give any error, but it does not split strings and also it removes some rows from the dataframe. I then tried using strsplit:

strsplit(survey$1, ",(?=[A-Z])", perl=T)

which works in splitting it correctly, but I'm not able to implement it so that each sentence becomes a different row of the same column, like cSplit does. The required output is:

survey$q1
[1] "I like this"
[2] "I like that"
[3] "I like this, but not much"
[4] "I like that, but not much"
[5] "I like this"
[6] "I like that"
[7] "I like this, but not much"
[8] "I like that"

Is there a way I can get it using one of the 2 methods? Thank you

Upvotes: 1

Views: 676

Answers (2)

Antonio
Antonio

Reputation: 158

The answer by @akrun is the right one. I just wanted to add that, if you need some strings to be split into more than 2 parts, the way for his code to work is simply to run the same line multiple times. I'm not entirely sure why this is the case, but it works

Upvotes: 1

akrun
akrun

Reputation: 887651

An option with separate_rows

library(dplyr)
library(tidyr)
survey %>% 
   separate_rows(q1, sep=",(?=[A-Z])")
#                       q1
#1               I like this
#2               I like that
#3 I like this, but not much
#4 I like that, but not much
#5               I like this
#6               I like that
#7 I like this, but not much
#8               I like that

With cSplit, there is an argument fixed which is TRUE by default, but if we use fixed = FALSE, it may fail. May be because it is not optimized for PCRE regex expressions

library(splitstackshape)
cSplit(survey, "q1", ",(?=[A-Z])", direction = "long", fixed = FALSE)

Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : invalid regular expression ',(?=[A-Z])', reason 'Invalid regexp'

One option to bypass it would be to modify the column with a function (sub/gsub) that can take PCRE regex to change the sep and then use cSplit on that sep

cSplit(transform(survey, q1 = sub(",(?=[A-Z])", ":", q1, perl = TRUE)), 
         "q1", sep=":", direction = "long")
#                        q1
#1:               I like this
#2:               I like that
#3: I like this, but not much
#4: I like that, but not much
#5:               I like this
#6:               I like that
#7: I like this, but not much
#8:               I like that

data

survey <- structure(list(q1 = c("I like this", "I like that", "I like this, but not much", 
"I like that, but not much", "I like this,I like that", "I like this, but not much,I like that"
)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 2

Related Questions