giordano
giordano

Reputation: 3152

R retrieving strings with sub: Why this does not work?

I would like to extract parts of strings. The string is:

> (x <- 'ab/cd efgh "xyz xyz"')
> [1] "ab/cd efgh \"xyz xyz\""

Now, I would like first to extract the first part:

> # get "ab/cd efgh"
> sub(" \"[/A-Za-z ]+\"","",x)
[1] "ab/cd efgh"

But I don't succeed in extracting the second part:

> # get "xyz xyz"
> sub("(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "ab/cd efgh \"xyz xyz\""

What is wrong with this code?
Thanks for help.

Upvotes: 1

Views: 63

Answers (2)

lmo
lmo

Reputation: 38510

To get this to work with sub, you have to match the whole string. The help file says

For sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).

So to get this to work with your regex, pre-pend the sometimes risky catchall ".*"

sub(".*(\"[A-Za-z ]+\")$","\\1",x, perl=TRUE)
[1] "\"xyz xyz\""

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

Your last snippet does not work because you reinsert the whole match back into the result: (\"[A-Za-z ]+\")$ matches and captures ", 1+ letters and spaces, " into Group 1 and \1 in the replacement puts it back.

You may actually get the last part inside quotes by removing all chars other than " at the start of the string:

x <- 'ab/cd efgh "xyz xyz"'
sub('^[^"]+', "", x)

See the R demo

The sub here will find and replace just once, and it will match the string start (with ^) followed with 1+ chars other than " with [^"]+ negated character class.

Upvotes: 1

Related Questions