wizkids121
wizkids121

Reputation: 656

What am I doing wrong in this gsub example?

I'm looking at this tutorial for using RegEx with stringr. Using the below example:

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub("([A-Z])[.]?", "\\1", str)

The tutorial tells me the output will generate:

[1] "George W Bush"    "Lyndon B Johnson"

But then I run an identical script on R and this is what happens:

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub("([A-Z])[.]?", "\\1", str)
[1] "i.e., George W Bush"    "Lyndon B Johnson, etc."

It simply returns the original text. Even when I run it on one of the Regex tester sites it still spits back the same thing.

From https://regex101.com/

Am I doing something wrong (likely)? Or is the tutorial wrong (doubtful)? I feel like I'm taking crazy pills here (confirmed).

Upvotes: 3

Views: 131

Answers (1)

beigel
beigel

Reputation: 1200

It looks like what you're doing is right and there is in fact a mistake in the tutorial. I tested the regex too, you can see it here. What the regex you are given is capturing any uppercase letter that may or may not be followed by a dot. For instance, "W." in "George W. Bush" is substituted with "W", but "i.e." is not captured and substituted because none of the characters are capitalized. If we had "I.E." it would get substituted with "IE". In order to capture the names given we need a different regex. One approach might be to capture the first name, middle initial, and last name. Now you could get the effect with the regex .*([A-Z][a-z]+)\s([A-Z])[.]+\s([A-Z][a-z]+).* see here or in R using

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub(".*([A-Z][a-z]+) ([A-Z])[.]+ ([A-Z][a-z]+).*", "\\1 \\2 \\3", str)
#> [1] "George W Bush"    "Lyndon B Johnson"

But that's probably not the most effective to go to sanitize a some names.

Upvotes: 1

Related Questions