Angus Campbell
Angus Campbell

Reputation: 592

most efficient way to make a new vector from an old vector in R

I am now to R and I have written a bad bit of code I think that I would like to do better. I am trying to change a character vector into a numeric.

> dge$samples$Developmental.stage
 [1] "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 6hr"  
 [6] "Amastigote 6hr"   "Amastigote 6hr"   "Amastigote 6hr"   "Amastigote 12hr"  "Amastigote 12hr" 
[11] "Amastigote 12hr"  "Amastigote 24hr"  "Amastigote 24hr"  "Amastigote 24hr"  "Amastigote 24hr" 
[16] "Amastigote 48hr"  "Amastigote 48hr"  "Amastigote 48hr"  "Amastigote 48hr"  "Amastigote 48hr" 
[21] "Amastigote 48hr"  "Amastigote 48hr"  "Amastigote 72hr"  "Amastigote 72hr"  "Amastigote 72hr" 
[26] "Amastigote 72hr"  "Amastigote 72hr"

Needs to be made into a list of numeric values corresponding to the hours post infection, hpi.

hpi <- rep(NA, length(dge$samples$Developmental.stage)) #empty vector to be filled by loop
for(i in 1:length(hpi)) {
  if(dge$samples$Developmental.stage[i] == "Amastigote 4 hpi"){
    hpi[i] = 4 
    } else if(dge$samples$Developmental.stage[i] == "Amastigote 6hr"){
    hpi[i] = 6 
    } else if(dge$samples$Developmental.stage[i] == "Amastigote 12hr"){
    hpi[i] = 12 
    } else if(dge$samples$Developmental.stage[i] == "Amastigote 24hr"){
    hpi[i] = 24 
    }else if(dge$samples$Developmental.stage[i] == "Amastigote 48hr"){
    hpi[i] = 48 
    } else if(dge$samples$Developmental.stage[i] == "Amastigote 72hr"){
    hpi[i] = 72 
    }
} #end of for loop

My code works but I feel like there are better ways to do this.

Upvotes: 0

Views: 112

Answers (2)

LMc
LMc

Reputation: 18642

I would recommend the answer from @akrun, but as an alternative:

as.numeric(gsub("\\D", "", dge$samples$Developmental.stage))

This deletes anything that's not a digit from your string.

From the base R documentation for regex:

Symbols \d, \s, \D and \S denote the digit and space classes and their negations (these are all extensions).


Benchmarking

An alternative if you wanted something fast, but strict (just as strict as your loop) would be to use a named vector. In the code below v <- dge$samples$Developental.stage:

lookup <- c("Amastigote 4 hpi" = 4,
            "Amastigote 6hr" = 6,
            "Amastigote 12hr" = 12,
            "Amastigote 24hr" = 24,
            "Amastigote 48hr" = 48,
            "Amastigote 72hr" = 72)

unname(lookup[v])
 [1]  4  4  4  4  6  6  6  6 12 12 12 24 24 24 24 48 48 48 48 48 48 48 72 72 72 72 72

Comparing these four options (I stuck your option in a function named loop) on my computer gave the following results:

library(microbenchmark)

microbenchmark(
  readr::parse_number(v),
  as.numeric(gsub("\\D", "", v)),
  loop(),
  unname(lookup[v]),
  times = 1000L
)

Unit: microseconds
                             expr    min     lq      mean  median      uq     max neval  cld
           readr::parse_number(v) 39.100 42.001 45.987904 44.4010 46.7020 162.901  1000   c 
 as.numeric(gsub("\\\\D", "", v)) 85.701 87.601 90.999841 90.3015 91.8005 179.001  1000    d
                           loop() 29.501 30.601 32.444417 31.9010 32.9000  92.401  1000  b  
                unname(lookup[v])  1.801  2.501  3.410506  3.4010  4.0000  33.301  1000 a

I would mention that looping and the named vector are both very strict and not very flexible. If there was an extra space or a misspelling then it would not be matched. These two options are both the fastest because of the direct comparison without having to parse a string.

parse_numbers is both readable, not terribly slow, and more flexible by comparison.

If you were worried about speed and know your data very well then the named vector might not be a bad option.


Lastly, you could create the named vector using parse_numbers and then apply it, which would give some flexibility. However there is some overhead in setting up the named vector this way:

u <- unique(v)
lookup <- setNames(readr::parse_number(u), u) 
unname(lookup[v])

With very large vectors this method is slightly faster than parse_numbers, but they're both far superior to the other options.

Upvotes: 3

akrun
akrun

Reputation: 887118

based on the input data, we could parse the numeric part with parse_number from readr

readr::parse_number(dge$samples$Developmental.stage)
#[1]  4  4  4  4  6  6  6  6 12 12 12 24 24 24 24 48 48 48 48 48 48 48 72 72 72 72 72

data

dge$samples$Developental.stage 
c("Amastigote 4 hpi", "Amastigote 4 hpi", "Amastigote 4 hpi", 
"Amastigote 4 hpi", "Amastigote 6hr", "Amastigote 6hr", "Amastigote 6hr", 
"Amastigote 6hr", "Amastigote 12hr", "Amastigote 12hr", "Amastigote 12hr", 
"Amastigote 24hr", "Amastigote 24hr", "Amastigote 24hr", "Amastigote 24hr", 
"Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr", 
"Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr", "Amastigote 72hr", 
"Amastigote 72hr", "Amastigote 72hr", "Amastigote 72hr", "Amastigote 72hr"
)

Upvotes: 2

Related Questions