What am I doing wrong in this gsub example?

Question

I'm looking at this tutorial for using RegEx with stringr. Using the below example:

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub("([A-Z])[.]?", "\1", str)

The tutorial tells me the output will generate:

[1] "George W Bush"    "Lyndon B Johnson"

But then I run an identical script on R and this is what happens:

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub("([A-Z])[.]?", "\1", str)
[1] "i.e., George W Bush"    "Lyndon B Johnson, etc."

It simply returns the original text. Even when I run it on one of the Regex tester sites it still spits back the same thing.

Am I doing something wrong (likely)? Or is the tutorial wrong (doubtful)? I feel like I'm taking crazy pills here (confirmed).

beigel · Accepted Answer

It looks like what you're doing is right and there is in fact a mistake in the tutorial. I tested the regex too, you can see it here. What the regex you are given is capturing any uppercase letter that may or may not be followed by a dot. For instance, "W." in "George W. Bush" is substituted with "W", but "i.e." is not captured and substituted because none of the characters are capitalized. If we had "I.E." it would get substituted with "IE". In order to capture the names given we need a different regex. One approach might be to capture the first name, middle initial, and last name. Now you could get the effect with the regex .*([A-Z][a-z]+)\s([A-Z])[.]+\s([A-Z][a-z]+).* see here or in R using

str <- c("i.e., George W. Bush", "Lyndon B. Johnson, etc.")
gsub(".*([A-Z][a-z]+) ([A-Z])[.]+ ([A-Z][a-z]+).*", "\1 \2 \3", str)
#> [1] "George W Bush"    "Lyndon B Johnson"

But that's probably not the most effective to go to sanitize a some names.

What am I doing wrong in this gsub example?

Answers (1)

Related Questions