Aman
Aman

Reputation: 417

How do I detect and delete abbreviations with regex in R?

I have a column with the following kind of strings:

Author 

Achebe, Chinua.  Ach
Akbar, M.j.  Akb
Alanahally, Srikrishna.  Ala

These are names of authors with their shortened abbreviation at the end. This is only at the end, because if I just look for three letter words, author names like Jon and Sam will be deleted. This usually occurs after two spaces. I want to eliminate this. I wrote the following regex to detect and delete these:

data$Author <- gsub("\\s([A-Z]+[A-Za-z]{2})\\s", "", data$Author)

What do I change in this so that I can delete these three letter abbreviations?

Upvotes: 2

Views: 388

Answers (2)

Haji Rahmatullah
Haji Rahmatullah

Reputation: 430

Try this with (find & replace) syntax ,

Find: \s?\s\w+$

Replace: leave it empty

Upvotes: 0

r2evans
r2evans

Reputation: 160687

Your \\s at the end of the pattern is forcing a space after the three-letters, and none of the samples have that here. Options:

  1. You cannot remove it or replace it with \\s*, as those will be too permissive (and break things):

    gsub("\\s([A-Z]+[A-Za-z]{2})", "", authors)
    # [1] "Achebe,nua. "         "Akbar, M.j. "         "Alanahally,krishna. "
    
  2. add a word-boundary \\b

    gsub("\\s([A-Z]+[A-Za-z]{2})\\b", "", authors)
    # [1] "Achebe, Chinua. "         "Akbar, M.j. "             "Alanahally, Srikrishna. "
    
  3. change to end-of-string

    gsub("\\s([A-Z]+[A-Za-z]{2})$", "", authors)
    # [1] "Achebe, Chinua. "         "Akbar, M.j. "             "Alanahally, Srikrishna. "
    

    (though I think this might be over-constraining).


Data

authors <- c("Achebe, Chinua.  Ach", "Akbar, M.j.  Akb", "Alanahally, Srikrishna.  Ala")

Upvotes: 2

Related Questions