Reputation: 3098
I have and ugly and complex set of strings that I have to split:
vec <- c("'01'", "'01' '02'",
"#bateau", "#bateau #batiment",
"#'autres 32'", "#'autres 32' #'batiment 30'", "#'autres 32' #'batiment 30' #'contenu 31'",
"#'34'", "#'34' #'33' #'35'")
vec
[1] "'01'" "'01' '02'"
[3] "#bateau" "#bateau #batiment"
[5] "#'autres 32'" "#'autres 32' #'batiment 30'"
[7] "#'autres 32' #'batiment 30' #'contenu 31'" "#'34'"
[9] "#'34' #'33' #'35'"
I need to split the string everywhere there is a space (), except if the space is between
'
. So in the example above, '01' '02'
would become '01'
and '02'
while #'autres 32' #'batiment 30'
would become #'autres 32'
and #'batiment 30'
.
I've tried getting inspiration from this question, but didn't get far:
strsplit(vec, "(\\s[^']+?)('.*?'|$)")
as this solution split some spaces that shouldn't and make me loose some information as well.
The result from the split should be something like:
res <- c("'01'", "'01'", "'02'",
"#bateau", "#bateau", "#batiment",
"#'autres 32'", "#'autres 32'", "#'batiment 30'", "#'autres 32'", "#'batiment 30'", "#'contenu 31'",
"#'34'", "#'34'", "#'33'", "#'35'")
What would be the proper regular expression to split this string?
Thanks
Upvotes: 3
Views: 251
Reputation: 626845
You may use
strsplit(vec, "'[^']*'(*SKIP)(*F)|\\s+", perl=TRUE)
See the R demo and the regex demo online.
Details
'[^']*'(*SKIP)(*F)
- '
, then any 0+ chars other than '
(see [^']*
) and then '
, and then this matched text is discarded and the next match is searched for from the position where the current match got failed (see (*SKIP)(*F)
)|
- or\s+
- 1+ whitespace chars.Since it is a PCRE pattern, the perl=TRUE
is obligatory.
Upvotes: 5