Bastien
Bastien

Reputation: 3098

Split R string at spaces but not when the space is between single quotes

I have and ugly and complex set of strings that I have to split:

vec <- c("'01'", "'01' '02'", 
         "#bateau", "#bateau #batiment",
         "#'autres 32'", "#'autres 32' #'batiment 30'", "#'autres 32' #'batiment 30' #'contenu 31'",
         "#'34'", "#'34' #'33' #'35'")
vec
[1] "'01'"                                      "'01' '02'"                                
[3] "#bateau"                                   "#bateau #batiment"                        
[5] "#'autres 32'"                              "#'autres 32' #'batiment 30'"              
[7] "#'autres 32' #'batiment 30' #'contenu 31'" "#'34'"                                    
[9] "#'34' #'33' #'35'" 

I need to split the string everywhere there is a space (), except if the space is between '. So in the example above, '01' '02' would become '01' and '02' while #'autres 32' #'batiment 30' would become #'autres 32' and #'batiment 30'.

I've tried getting inspiration from this question, but didn't get far:

strsplit(vec, "(\\s[^']+?)('.*?'|$)")

as this solution split some spaces that shouldn't and make me loose some information as well.

The result from the split should be something like:

res <- c("'01'", "'01'", "'02'", 
         "#bateau", "#bateau", "#batiment",
         "#'autres 32'", "#'autres 32'", "#'batiment 30'", "#'autres 32'", "#'batiment 30'", "#'contenu 31'",
         "#'34'", "#'34'", "#'33'", "#'35'")

What would be the proper regular expression to split this string?

Thanks

Upvotes: 3

Views: 251

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You may use

strsplit(vec, "'[^']*'(*SKIP)(*F)|\\s+", perl=TRUE)

See the R demo and the regex demo online.

Details

  • '[^']*'(*SKIP)(*F) - ', then any 0+ chars other than ' (see [^']*) and then ', and then this matched text is discarded and the next match is searched for from the position where the current match got failed (see (*SKIP)(*F))
  • | - or
  • \s+ - 1+ whitespace chars.

Since it is a PCRE pattern, the perl=TRUE is obligatory.

Upvotes: 5

Related Questions