Reputation: 3152
I would like to extract parts of strings. The string is:
> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""
Now, I would like first to extract the first part:
> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"
But I don't succeed in extracting the second part:
> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""
What is wrong with this code?
Thanks for help.
Upvotes: 1
Views: 63
Reputation: 38510
To get this to work with sub
, you have to match the whole string. The help file says
For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"
sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""
Upvotes: 1
Reputation: 626927
Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$
matches and captures "
, 1+ letters and spaces, "
into Group 1 and \1
in the replacement puts it back.
You may actually get the last part inside quotes by removing all chars other than "
at the start of the string:
x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)
See the R demo
The sub
here will find and replace just once, and it will match the string start (with ^
) followed with 1+ chars other than "
with [^"]+
negated character class.
Upvotes: 1