Soon Hwee N
Soon Hwee N

Reputation: 23

R: truncate strings to a word

I'm new to R, and trying to use it to truncate words in the headers of a spreadsheet to a word. For example:

Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100);

Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);

So I would like to shorten the taxon to a single word without the numbers: like Clostridia and Mollicutes. I think it can be done, but can't figure how.

Thanks.

Upvotes: 1

Views: 149

Answers (3)

Sotos
Sotos

Reputation: 51592

Is this what you need? Or did I completely misunderstood?

gsub('\\(.*\\)', '', unlist(strsplit(x, ';'))[3])
#[1] "Clostridia"

where x is your column name

Upvotes: 0

akrun
akrun

Reputation: 887108

We can use sub

sub("\\(.*", "", "Firmicutes(100)")

Suppose, we read the data in 'R' using read.csv/read.table with check.names=FALSE, then we apply the same code on the column names

colnames(data) <- sub("\\(.*", "", colnames(data))

If it is a single string

library(stringr)
 str1 <- "Bacteria(100);Firmicutes(100);Clostridia(100);Clostridiales(100);Lachnospiraceae(100);unclassified(100)"

str_extract_all(str1, "[^()0-9;]+")[[1]]
#[1] "Bacteria"        "Firmicutes"      "Clostridia"      "Clostridiales"   "Lachnospiraceae"
#[6] "unclassified"   

Update

Suppose if we need to extract the third word i.e. "Clostridia"

sub("^([^(]+[(][^;]+;){2}(\\w+).*", "\\2", str1)
#[1] "Clostridia"

Upvotes: 1

nya
nya

Reputation: 2250

Using only base commands, the names can be extracted with this code:

nam <- c("Bacteria(100);Tenericutes(100);Mollicutes(100);Mollicutes_RF9(100);unclassified(100);unclassified(100);")
nam <- strsplit(nam, ";")[[1]]
nam <- unname(sapply(nam, FUN=function(x) sub("\\(.*", "", x)))

nam
[1] "Bacteria"       "Tenericutes"    "Mollicutes"     "Mollicutes_RF9" "unclassified"   "unclassified"

Upvotes: 0

Related Questions