Reputation: 417
I have a column with the following kind of strings:
Author
Achebe, Chinua. Ach
Akbar, M.j. Akb
Alanahally, Srikrishna. Ala
These are names of authors with their shortened abbreviation at the end. This is only at the end, because if I just look for three letter words, author names like Jon and Sam will be deleted. This usually occurs after two spaces. I want to eliminate this. I wrote the following regex to detect and delete these:
data$Author <- gsub("\\s([A-Z]+[A-Za-z]{2})\\s", "", data$Author)
What do I change in this so that I can delete these three letter abbreviations?
Upvotes: 2
Views: 388
Reputation: 430
Try this with (find & replace) syntax ,
Find: \s?\s\w+$
Replace: leave it empty
Upvotes: 0
Reputation: 160687
Your \\s
at the end of the pattern is forcing a space after the three-letters, and none of the samples have that here. Options:
You cannot remove it or replace it with \\s*
, as those will be too permissive (and break things):
gsub("\\s([A-Z]+[A-Za-z]{2})", "", authors)
# [1] "Achebe,nua. " "Akbar, M.j. " "Alanahally,krishna. "
add a word-boundary \\b
gsub("\\s([A-Z]+[A-Za-z]{2})\\b", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
change to end-of-string
gsub("\\s([A-Z]+[A-Za-z]{2})$", "", authors)
# [1] "Achebe, Chinua. " "Akbar, M.j. " "Alanahally, Srikrishna. "
(though I think this might be over-constraining).
Data
authors <- c("Achebe, Chinua. Ach", "Akbar, M.j. Akb", "Alanahally, Srikrishna. Ala")
Upvotes: 2