PavoDive
PavoDive

Reputation: 6496

R gsub remove word variation ONLY at end of string

I have the following vector:

a <- c("SOCORRO SANTANDER", "SANTANDER DE QUILICHAO", 
       "LOS PATIOS NORTE DE SANTANDER", "LOS PATIOS NTE DE S DER")

and need to remove all occurrences of "SANTANDER" or it's abbreviation (and preceding NORTE or its abbreviation, if existing) when they are only at the end of string.

So far I've tried (in comment why it fails):

gsub("(.*)( N.*DER$)", "\\1", a)       # Fails at SOCORRO
gsub("(.*)( N.*DER$| DER$)", "\\1", a) # Only removes DER at LOS PATIOS
gsub("(.*)([ N.*DER$]|[ DER$])", "\\1", a) # Removes trailing R (??)
gsub("(.*)( N?.*DER$)", "\\1", a)  # Fails removing " NTE DE S" and "NORTE DE"

So, in particular, I'd like to know how to adequately remove the unwanted part of the string, but more in general I'd like to know the right way to create regexes to test this kind of situations (my first writing was "to use OR (|) inside a group", I seriously expected attempts 2 or 3 to work).

Expected result is:

a
## [1] "SOCORRO"  "SANTANDER DE QUILICHAO"  "LOS PATIOS"  "LOS PATIOS"

Upvotes: 1

Views: 1521

Answers (2)

akrun
akrun

Reputation: 887028

We can try

sub("(.*)(\\s+N.*(DER)$)|\\s+SANTANDER$", "\\1", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"     

Or

sub("\\s+(N(\\S+\\s+){1,}|)\\S*DER$", "", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"  

Upvotes: 1

bgoldst
bgoldst

Reputation: 35314

sub('(\\s*\\b(NORTE\\s+DE|NTE\\s+DE))?\\s*\\b(SANTANDER|S\\s+DER)$','',a);
## [1] "SOCORRO"  "SANTANDER DE QUILICHAO"  "LOS PATIOS"  "LOS PATIOS"
  • We don't need gsub(), since we don't need to match multiple times within the same string.
  • A bracket expression will match only a single character, hence it's not appropriate for this regex.
  • The dollar character is only special when outside of a bracket expression.
  • You seem to have tried matching both the abbreviation and full-length words with the same regex piece. I would advise against this; they are conceptually completely different pieces. If a word and its abbreviation happen to share a suffix, then that's circumstantial; you shouldn't build a regex around that fact. Hence I think alternations are most appropriate here.

Upvotes: 2

Related Questions