Reputation: 25
I have a dataset with species name where some names originally used are now obsolete, so they are noted "old_speciesretired use new_species", whereas correct cells are just noted "new_species".
Here is a sample of the data :
df<- data.frame(species=c("Etheostoma spectabile","Ictalurus furcatus","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus","Ictalurus punctatus","Ictalurus punctatus","Micropterus salmoides","Etheostoma olmstedi","Noturus insignis","Lepomis auritus","Lepomis auritus","Nocomis leptocephalus","Scartomyzon rupiscartes***retired***use Moxostoma rupiscartes","Lepomis cyanellus","Notropis chlorocephalus","Scartomyzon cervinus***retired***use Moxostoma cervinum","Ictalurus punctatus","Lythrurus ardens","Moxostoma pappillosum","Micropterus salmoides","Micropterus salmoides","Ictalurus punctatus"))
I have tried
sapply(strsplit(df$species, split='***retired***use', fixed = T),function(x) (x[2])))
but the cells for which the data is correct returns NA because they do not contain the split.
Is there a way to make the split just for the cells actually containing it?
Upvotes: 0
Views: 28
Reputation: 21400
You can change the old names to the new names using gsub
plus backreference:
gsub(".*\\*\\*\\*retired\\*\\*\\*use\\s(.*)", "\\1", df$species)
# [1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides"
# [5] "Ictalurus punctatus" "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides"
# [9] "Etheostoma olmstedi" "Noturus insignis" "Lepomis auritus" "Lepomis auritus"
# [13] "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus" "Notropis chlorocephalus"
# [17] "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
# [21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
Explanation:
.*
anything any number of times followed by ...
\\*\\*\\*retired\\*\\*\\*use\\s
... the literal pattern ***retired***use
followed by ...
(.*)
... anything any number of times--that's the capturing group that the backreference \\1
in the replacement argument of gsub
refers to
Upvotes: 1
Reputation: 887048
We can create an index with grep
and then split using those rows
i1 <- grep('retired', df$species)
df$species <- as.character(df$species)
df$species[i1] <- sapply(strsplit(df$species[i1], "***retired***use ",
fixed = TRUE), `[`, 2)
df$species
#[1] "Etheostoma spectabile" "Ictalurus furcatus" "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
#[6] "Ictalurus punctatus" "Ictalurus punctatus" "Micropterus salmoides" "Etheostoma olmstedi" "Noturus insignis"
#[11] "Lepomis auritus" "Lepomis auritus" "Nocomis leptocephalus" "Moxostoma rupiscartes" "Lepomis cyanellus"
#[16] "Notropis chlorocephalus" "Moxostoma cervinum" "Ictalurus punctatus" "Lythrurus ardens" "Moxostoma pappillosum"
#[21] "Micropterus salmoides" "Micropterus salmoides" "Ictalurus punctatus"
Or by using regex with sub
df$species <- sub(".*\\*{3}retired\\*{3}use\\s+", "", df$species)
Upvotes: 0