Reputation: 33
Given a vector of character strings, where each string is a comma-separated list of species names (i.e. Genus species). Each string can have a variable number of species in it (e.g. as shown in the example below, the number of species in a given string ranges from 1 to 3).
trees <- c("Erythrina poeppigiana", "Erythrina poeppigiana, Juglans regia x Juglans nigra", "Erythrina poeppigiana, Juglans regia x Juglans nigra, Chloroleucon eurycyclum")
I wish to obtain a vector of character strings of the same length, but where each string is a comma-separated list of the genus portions of the names only
genera <- c("Erythrina", "Erythrina, Juglans", "Erythrina, Juglans, Chloroleucon")
The screwy species is the "Juglans regia x Juglans nigra" hyrbid species. This should just come out as "Juglans", as it is all contained between two commas and is therefore just one species. In hybrids like this, the genus is always the same on both sides of the "x", so just the first word in that portion of the string is fine, just like with the more standard cases. However, solutions that attempt to pull out "every other word" won't work due to these hybrids.
My attempt was to first strsplit by ", " to separate out the individual species names, then strsplit again by " " to separate out the genus names:
split.list <- sapply(strsplit(trees, split=", "), strsplit, 1, split=" ")
split.list
[[1]]
[[1]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]]
[[2]][[1]]
[1] "Erythrina" "poeppigiana"
[[2]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]]
[[3]][[1]]
[1] "Erythrina" "poeppigiana"
[[3]][[2]]
[1] "Juglans" "regia" "x" "Juglans" "nigra"
[[3]][[3]]
[1] "Chloroleucon" "eurycyclum"
But then the indexing to pull out the genus names and recombine is quite complicated (and I can't even figure it out!). Is there a cleaner solution for an ordered split and recombination?
It would also be acceptable to leverage the fact that genus names are the only words that are capitalized in all string. Maybe a regex that pull just words with capital letters?
Upvotes: 0
Views: 172
Reputation: 51582
Here is an idea via Base R,
sapply(strsplit(trees, ' '), function(i) toString(i[c(TRUE, FALSE)]))
#[1] "Erythrina" "Erythrina, Terminalia" "Erythrina, Terminalia, Chloroleucon"
EDIT
Further to your comment, for the new trees
, you can simply do,
sapply(strsplit(trees, ', '), function(i) toString(sub('\\s+.*', '', i)))
#[1] "Erythrina, Juglans" "Erythrina"
#[3] "Erythrina, Juglans, Chloroleucon"
Upvotes: 2