Reputation: 6496
I have the following vector:
a <- c("SOCORRO SANTANDER", "SANTANDER DE QUILICHAO",
"LOS PATIOS NORTE DE SANTANDER", "LOS PATIOS NTE DE S DER")
and need to remove all occurrences of "SANTANDER" or it's abbreviation (and preceding NORTE or its abbreviation, if existing) when they are only at the end of string.
So far I've tried (in comment why it fails):
gsub("(.*)( N.*DER$)", "\\1", a) # Fails at SOCORRO
gsub("(.*)( N.*DER$| DER$)", "\\1", a) # Only removes DER at LOS PATIOS
gsub("(.*)([ N.*DER$]|[ DER$])", "\\1", a) # Removes trailing R (??)
gsub("(.*)( N?.*DER$)", "\\1", a) # Fails removing " NTE DE S" and "NORTE DE"
So, in particular, I'd like to know how to adequately remove the unwanted part of the string, but more in general I'd like to know the right way to create regexes to test this kind of situations (my first writing was "to use OR (|
) inside a group", I seriously expected attempts 2 or 3 to work).
Expected result is:
a
## [1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS" "LOS PATIOS"
Upvotes: 1
Views: 1521
Reputation: 887028
We can try
sub("(.*)(\\s+N.*(DER)$)|\\s+SANTANDER$", "\\1", a)
#[1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS"
#[4] "LOS PATIOS"
Or
sub("\\s+(N(\\S+\\s+){1,}|)\\S*DER$", "", a)
#[1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS"
#[4] "LOS PATIOS"
Upvotes: 1
Reputation: 35314
sub('(\\s*\\b(NORTE\\s+DE|NTE\\s+DE))?\\s*\\b(SANTANDER|S\\s+DER)$','',a);
## [1] "SOCORRO" "SANTANDER DE QUILICHAO" "LOS PATIOS" "LOS PATIOS"
gsub()
, since we don't need to match multiple times within the same string.Upvotes: 2