Reputation: 477
I have a column containing strings (i.e. names of species), such as the one below:
Species
---------
Aaaaba fossicollis
Aaadonta constricta babelthuapi
Aaadonta constricta constricta
Aaadonta constricta komakanensis
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos
My goal is to remove specifically the third word (words being separated by spaces) from every string that is made up exclusively of letters. So the strings that include numbers would remain the same, as well as the strings that contain just two words.
My output would be this:
Species
---------
Aaaaba fossicollis
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos
I have tried this code, but it is also removing part of the strings with numerical characters, as well as the second word in two worded strings:
df$Species<-gsub("\\s*\\w*$", "", df$Species)
Upvotes: 2
Views: 94
Reputation: 39717
You can use sub
.
sub("^([^0-9 ]+ [^0-9 ]+) [^0-9 ]+([^0-9]*)$", "\\1\\2", s)
# [1] "Aaaaba fossicollis" "Aaadonta constricta"
# [3] "Aaadonta constricta" "Aaadonta constricta"
# [5] "Aaadonta constricta ssp. 2 DAB-2020" "Aaadonta irregularis"
# [7] "Aaadonta sp. 1 DAB-2020" "Aaadonta sp. DAB-2021"
# [9] "Aacanthocnema dobsoni" "Aagaardia protensa"
#[11] "Aaptos aaptos" "add1 test string x"
#[13] "add test string x1" "add test string1 x"
#[15] "add test x"
^
.. Start of String
([^0-9 ]+ [^0-9 ]+)
.. Two words and store them in \\1
[^0-9 ]+
.. another word
([^0-9]*)$
.. End of string not containing a number and stor in \2.
Data
s <- c("Aaaaba fossicollis", "Aaadonta constricta babelthuapi",
"Aaadonta constricta constricta", "Aaadonta constricta komakanensis",
"Aaadonta constricta ssp. 2 DAB-2020", "Aaadonta irregularis",
"Aaadonta sp. 1 DAB-2020", "Aaadonta sp. DAB-2021",
"Aacanthocnema dobsoni", "Aagaardia protensa", "Aaptos aaptos",
"add1 test string x", "add test string x1", "add test string1 x",
"add test string x" )
Upvotes: 1
Reputation: 19191
With ifelse
and grepl
/sub
, keeping the regex simple, just subbing lines without digits. The sub
searches 2 non-digit words followed by a space and then anything and replaces that with the first 2 words.
with(df, ifelse(grepl("\\d+", Species),
Species, sub("(^\\D+ \\D+) .*", "\\1", Species)))
[1] "Aaaaba fossicollis" "Aaadonta constricta"
[3] "Aaadonta constricta" "Aaadonta constricta"
[5] "Aaadonta constricta ssp. 2 DAB-2020" "Aaadonta irregularis"
[7] "Aaadonta sp. 1 DAB-2020" "Aaadonta sp. DAB-2021"
[9] "Aacanthocnema dobsoni" "Aagaardia protensa"
[11] "Aaptos aaptos"
df <- structure(list(Species = c("Aaaaba fossicollis", "Aaadonta constricta babelthuapi",
"Aaadonta constricta constricta", "Aaadonta constricta komakanensis",
"Aaadonta constricta ssp. 2 DAB-2020", "Aaadonta irregularis",
"Aaadonta sp. 1 DAB-2020", "Aaadonta sp. DAB-2021", "Aacanthocnema dobsoni",
"Aagaardia protensa", "Aaptos aaptos")), class = "data.frame", row.names = c(NA,
-11L))
Upvotes: 0
Reputation: 522264
We can use sub()
with a capture group:
df$Species <- sub("(\\S+ \\S+) [A-Za-z]+(?!\\S)", "\\1", df$Species, perl=TRUE)
Here is an explanation of the regex pattern:
(\S+ \S+)
match and capture the first two non whitespace terms
match a single space[A-Za-z]+
then match a third word consisting of only letters(?!\\S)
assert that this third word is followed by space or the end of the stringThen, we replace with \1
, to keep just the first two terms.
Upvotes: 2