tadeufontes
tadeufontes

Reputation: 477

How to remove the third word from a string, but only if the string contains exclusively letters?

I have a column containing strings (i.e. names of species), such as the one below:

 Species
---------
Aaaaba fossicollis
Aaadonta constricta babelthuapi
Aaadonta constricta constricta
Aaadonta constricta komakanensis
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

My goal is to remove specifically the third word (words being separated by spaces) from every string that is made up exclusively of letters. So the strings that include numbers would remain the same, as well as the strings that contain just two words.

My output would be this:

 Species
---------
Aaaaba fossicollis
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

I have tried this code, but it is also removing part of the strings with numerical characters, as well as the second word in two worded strings:

df$Species<-gsub("\\s*\\w*$", "", df$Species)

Upvotes: 2

Views: 94

Answers (3)

GKi
GKi

Reputation: 39717

You can use sub.

sub("^([^0-9 ]+ [^0-9 ]+) [^0-9 ]+([^0-9]*)$", "\\1\\2", s)
# [1] "Aaaaba fossicollis"                  "Aaadonta constricta"                
# [3] "Aaadonta constricta"                 "Aaadonta constricta"                
# [5] "Aaadonta constricta ssp. 2 DAB-2020" "Aaadonta irregularis"               
# [7] "Aaadonta sp. 1 DAB-2020"             "Aaadonta sp. DAB-2021"              
# [9] "Aacanthocnema dobsoni"               "Aagaardia protensa"                 
#[11] "Aaptos aaptos"                       "add1 test string x"                 
#[13] "add test string x1"                  "add test string1 x"                 
#[15] "add test x"                         

^ .. Start of String
([^0-9 ]+ [^0-9 ]+) .. Two words and store them in \\1
[^0-9 ]+ .. another word
([^0-9]*)$ .. End of string not containing a number and stor in \2.

Data

s <- c("Aaaaba fossicollis", "Aaadonta constricta babelthuapi",
       "Aaadonta constricta constricta", "Aaadonta constricta komakanensis", 
       "Aaadonta constricta ssp. 2 DAB-2020", "Aaadonta irregularis", 
       "Aaadonta sp. 1 DAB-2020", "Aaadonta sp. DAB-2021",
       "Aacanthocnema dobsoni", "Aagaardia protensa", "Aaptos aaptos",
       "add1 test string x", "add test string x1", "add test string1 x",
       "add test string x" )

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19191

With ifelse and grepl/sub, keeping the regex simple, just subbing lines without digits. The sub searches 2 non-digit words followed by a space and then anything and replaces that with the first 2 words.

with(df, ifelse(grepl("\\d+", Species), 
           Species, sub("(^\\D+ \\D+) .*", "\\1", Species)))
 [1] "Aaaaba fossicollis"                  "Aaadonta constricta"                
 [3] "Aaadonta constricta"                 "Aaadonta constricta"                
 [5] "Aaadonta constricta ssp. 2 DAB-2020" "Aaadonta irregularis"               
 [7] "Aaadonta sp. 1 DAB-2020"             "Aaadonta sp. DAB-2021"              
 [9] "Aacanthocnema dobsoni"               "Aagaardia protensa"                 
[11] "Aaptos aaptos"

Data

df <- structure(list(Species = c("Aaaaba fossicollis", "Aaadonta constricta babelthuapi", 
"Aaadonta constricta constricta", "Aaadonta constricta komakanensis", 
"Aaadonta constricta ssp. 2 DAB-2020", "Aaadonta irregularis", 
"Aaadonta sp. 1 DAB-2020", "Aaadonta sp. DAB-2021", "Aacanthocnema dobsoni", 
"Aagaardia protensa", "Aaptos aaptos")), class = "data.frame", row.names = c(NA, 
-11L))

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522264

We can use sub() with a capture group:

df$Species <- sub("(\\S+ \\S+) [A-Za-z]+(?!\\S)", "\\1", df$Species, perl=TRUE)

Here is an explanation of the regex pattern:

  • (\S+ \S+) match and capture the first two non whitespace terms
  • match a single space
  • [A-Za-z]+ then match a third word consisting of only letters
  • (?!\\S) assert that this third word is followed by space or the end of the string

Then, we replace with \1, to keep just the first two terms.

Demo

Upvotes: 2

Related Questions