Ally D
Ally D

Reputation: 165

extract last word from string only if more than one word R

I'm having a slight issue with extracting a word from a character string. I have a column that is the taxon name for a species "Genus species". I was trying to create a new column with just the species. Initially I just used

library(stringr)
count$species  <- word(count$taxon_name, 2)

to extract the second word. This worked great until I realized there are a few entries in the taxon_name column that have a parenthesis word between the genus and species, like so, "Genus (word) species".

To remove that I wrote this code, which worked great removing the parantehsis from the entries that had that extra word:

count$new_taxon <- gsub("\\([^()]*\\)", "", count$taxon_name)

and then performed the above on the new column

count$species  <- word(count$new_taxon, 2)

That still works on all the ones that haven't been altered, but if an entry had a parenthesis removed it just leaves the entry blank, and doesn't extract anything. I think it might be recognizing the space as a word? I tried changing around whether the column was a factor or character column and it didn't make a difference. Any suggestions?

NOTE: Essentially there are three types of input in the taxon_name column (1) Genus species (2) Genus and (3) Genus (word) species.

When I try anything that extracts the last word, it deals with case (1) and (3) but now it includes (2) which I just want to be NA, because it doesn't have a species.

Upvotes: 1

Views: 1061

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

Maybe something like the following.

x <- c("Genus species", "Genus", "Genus (word) species")
y <- gsub(".*[[:blank:]](\\w+)$", "\\1", x)
is.na(y) <- y == "Genus"
y
[1] "species" NA        "species"

Note that it should be very difficult to search for "species" since we don't have a full list of them. That's why I've opted by this, to set the elements of the result y to NA if they are equal to "Genus".

Upvotes: 1

Mako212
Mako212

Reputation: 7292

Assuming "species" is never multiple words, you can do it like this:

count$species <- gsub("^.*\\s(\\w+)$", "\\1", count$taxon_name)

The pattern (\\s(\\w+)$ means match a space, then multiple word characters, then the end of the string, in otherwords, it matches the last word of the string. Then we replace with capture group 1 using \\1

Live example:

https://regex101.com/r/toJeTg/1

Upvotes: 1

Related Questions