R gsub remove word variation ONLY at end of string

Question

I have the following vector:

a <- c("SOCORRO SANTANDER", "SANTANDER DE QUILICHAO", 
       "LOS PATIOS NORTE DE SANTANDER", "LOS PATIOS NTE DE S DER")

and need to remove all occurrences of "SANTANDER" or it's abbreviation (and preceding NORTE or its abbreviation, if existing) when they are only at the end of string.

So far I've tried (in comment why it fails):

gsub("(.*)( N.*DER$)", "\1", a)       # Fails at SOCORRO
gsub("(.*)( N.*DER$| DER$)", "\1", a) # Only removes DER at LOS PATIOS
gsub("(.*)([ N.*DER$]|[ DER$])", "\1", a) # Removes trailing R (??)
gsub("(.*)( N?.*DER$)", "\1", a)  # Fails removing " NTE DE S" and "NORTE DE"

So, in particular, I'd like to know how to adequately remove the unwanted part of the string, but more in general I'd like to know the right way to create regexes to test this kind of situations (my first writing was "to use OR (|) inside a group", I seriously expected attempts 2 or 3 to work).

Expected result is:

a
## [1] "SOCORRO"  "SANTANDER DE QUILICHAO"  "LOS PATIOS"  "LOS PATIOS"

akrun · Accepted Answer

We can try

sub("(.*)(\s+N.*(DER)$)|\s+SANTANDER$", "\1", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"

Or

sub("\s+(N(\S+\s+){1,}|)\S*DER$", "", a)
#[1] "SOCORRO"                "SANTANDER DE QUILICHAO" "LOS PATIOS"            
#[4] "LOS PATIOS"

R gsub remove word variation ONLY at end of string

Answers (2)

Related Questions