How do I detect and delete abbreviations with regex in R?

Question

I have a column with the following kind of strings:

Author 

Achebe, Chinua.  Ach
Akbar, M.j.  Akb
Alanahally, Srikrishna.  Ala

These are names of authors with their shortened abbreviation at the end. This is only at the end, because if I just look for three letter words, author names like Jon and Sam will be deleted. This usually occurs after two spaces. I want to eliminate this. I wrote the following regex to detect and delete these:

data$Author <- gsub("\s([A-Z]+[A-Za-z]{2})\s", "", data$Author)

What do I change in this so that I can delete these three letter abbreviations?

r2evans · Accepted Answer

Your \s at the end of the pattern is forcing a space after the three-letters, and none of the samples have that here. Options:

You cannot remove it or replace it with \s*, as those will be too permissive (and break things):

gsub("\s([A-Z]+[A-Za-z]{2})", "", authors)
# [1] "Achebe,nua. "         "Akbar, M.j. "         "Alanahally,krishna. "

add a word-boundary \b

gsub("\s([A-Z]+[A-Za-z]{2})\b", "", authors)
# [1] "Achebe, Chinua. "         "Akbar, M.j. "             "Alanahally, Srikrishna. "

change to end-of-string

gsub("\s([A-Z]+[A-Za-z]{2})$", "", authors)
# [1] "Achebe, Chinua. "         "Akbar, M.j. "             "Alanahally, Srikrishna. "

(though I think this might be over-constraining).

Data

authors <- c("Achebe, Chinua.  Ach", "Akbar, M.j.  Akb", "Alanahally, Srikrishna.  Ala")

How do I detect and delete abbreviations with regex in R?

Answers (2)

Related Questions