extract last word from string only if more than one word R

Question

I'm having a slight issue with extracting a word from a character string. I have a column that is the taxon name for a species "Genus species". I was trying to create a new column with just the species. Initially I just used

library(stringr)
count$species  <- word(count$taxon_name, 2)

to extract the second word. This worked great until I realized there are a few entries in the taxon_name column that have a parenthesis word between the genus and species, like so, "Genus (word) species".

To remove that I wrote this code, which worked great removing the parantehsis from the entries that had that extra word:

count$new_taxon <- gsub("$[^()]*$", "", count$taxon_name)

and then performed the above on the new column

count$species  <- word(count$new_taxon, 2)

That still works on all the ones that haven't been altered, but if an entry had a parenthesis removed it just leaves the entry blank, and doesn't extract anything. I think it might be recognizing the space as a word? I tried changing around whether the column was a factor or character column and it didn't make a difference. Any suggestions?

NOTE: Essentially there are three types of input in the taxon_name column (1) Genus species (2) Genus and (3) Genus (word) species.

When I try anything that extracts the last word, it deals with case (1) and (3) but now it includes (2) which I just want to be NA, because it doesn't have a species.

Rui Barradas · Accepted Answer

Maybe something like the following.

x <- c("Genus species", "Genus", "Genus (word) species")
y <- gsub(".*[[:blank:]](\w+)$", "\1", x)
is.na(y) <- y == "Genus"
y
[1] "species" NA        "species"

Note that it should be very difficult to search for "species" since we don't have a full list of them. That's why I've opted by this, to set the elements of the result y to NA if they are equal to "Genus".

extract last word from string only if more than one word R

Answers (2)

Related Questions