Reputation: 165
I'm having a slight issue with extracting a word from a character string. I have a column that is the taxon name for a species "Genus species". I was trying to create a new column with just the species. Initially I just used
library(stringr)
count$species <- word(count$taxon_name, 2)
to extract the second word. This worked great until I realized there are a few entries in the taxon_name
column that have a parenthesis word between the genus and species, like so, "Genus (word) species".
To remove that I wrote this code, which worked great removing the parantehsis from the entries that had that extra word:
count$new_taxon <- gsub("\\([^()]*\\)", "", count$taxon_name)
and then performed the above on the new column
count$species <- word(count$new_taxon, 2)
That still works on all the ones that haven't been altered, but if an entry had a parenthesis removed it just leaves the entry blank, and doesn't extract anything. I think it might be recognizing the space as a word? I tried changing around whether the column was a factor or character column and it didn't make a difference. Any suggestions?
NOTE: Essentially there are three types of input in the taxon_name column (1) Genus species (2) Genus and (3) Genus (word) species.
When I try anything that extracts the last word, it deals with case (1) and (3) but now it includes (2) which I just want to be NA, because it doesn't have a species.
Upvotes: 1
Views: 1061
Reputation: 76402
Maybe something like the following.
x <- c("Genus species", "Genus", "Genus (word) species")
y <- gsub(".*[[:blank:]](\\w+)$", "\\1", x)
is.na(y) <- y == "Genus"
y
[1] "species" NA "species"
Note that it should be very difficult to search for "species"
since we don't have a full list of them. That's why I've opted by this, to set the elements of the result y
to NA
if they are equal to "Genus"
.
Upvotes: 1
Reputation: 7292
Assuming "species" is never multiple words, you can do it like this:
count$species <- gsub("^.*\\s(\\w+)$", "\\1", count$taxon_name)
The pattern (\\s(\\w+)$
means match a space, then multiple word characters, then the end of the string, in otherwords, it matches the last word of the string. Then we replace with capture group 1 using \\1
Live example:
https://regex101.com/r/toJeTg/1
Upvotes: 1