Reputation: 23
I'm new to R, and trying to use it to truncate words in the headers of a spreadsheet to a word. For example:
Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);
Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);
So I would like to shorten the taxon to a single word without the numbers: like Clostridia and Mollicutes. I think it can be done, but can't figure how.
Thanks.
Upvotes: 1
Views: 149
Reputation: 51592
Is this what you need? Or did I completely misunderstood?
gsub('\\(.*\\)', '', unlist(strsplit(x, ';'))[3])
#[1] "Clostridia"
where x
is your column name
Upvotes: 0
Reputation: 887108
We can use sub
sub("\\(.*", "", "Firmicutes(100)")
Suppose, we read the data in 'R' using read.csv/read.table
with check.names=FALSE
, then we apply the same code on the column names
colnames(data) <- sub("\\(.*", "", colnames(data))
If it is a single string
library(stringr)
str1 <- "Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100)"
str_extract_all(str1, "[^()0-9;]+")[[1]]
#[1] "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Lachnospiraceae"
#[6] "unclassified"
Suppose if we need to extract the third word i.e. "Clostridia"
sub("^([^(]+[(][^;]+;){2}(\\w+).*", "\\2", str1)
#[1] "Clostridia"
Upvotes: 1
Reputation: 2250
Using only base commands, the names can be extracted with this code:
nam <- c("Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);")
nam <- strsplit(nam, ";")[[1]]
nam <- unname(sapply(nam, FUN=function(x) sub("\\(.*", "", x)))
nam
[1] "Bacteria" "Tenericutes" "Mollicutes" "Mollicutes_RF9" "unclassified" "unclassified"
Upvotes: 0