Reputation: 592
I am now to R and I have written a bad bit of code I think that I would like to do better. I am trying to change a character vector into a numeric.
> dge$samples$Developmental.stage
[1] "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 4 hpi" "Amastigote 6hr"
[6] "Amastigote 6hr" "Amastigote 6hr" "Amastigote 6hr" "Amastigote 12hr" "Amastigote 12hr"
[11] "Amastigote 12hr" "Amastigote 24hr" "Amastigote 24hr" "Amastigote 24hr" "Amastigote 24hr"
[16] "Amastigote 48hr" "Amastigote 48hr" "Amastigote 48hr" "Amastigote 48hr" "Amastigote 48hr"
[21] "Amastigote 48hr" "Amastigote 48hr" "Amastigote 72hr" "Amastigote 72hr" "Amastigote 72hr"
[26] "Amastigote 72hr" "Amastigote 72hr"
Needs to be made into a list of numeric values corresponding to the hours post infection, hpi.
hpi <- rep(NA, length(dge$samples$Developmental.stage)) #empty vector to be filled by loop
for(i in 1:length(hpi)) {
if(dge$samples$Developmental.stage[i] == "Amastigote 4 hpi"){
hpi[i] = 4
} else if(dge$samples$Developmental.stage[i] == "Amastigote 6hr"){
hpi[i] = 6
} else if(dge$samples$Developmental.stage[i] == "Amastigote 12hr"){
hpi[i] = 12
} else if(dge$samples$Developmental.stage[i] == "Amastigote 24hr"){
hpi[i] = 24
}else if(dge$samples$Developmental.stage[i] == "Amastigote 48hr"){
hpi[i] = 48
} else if(dge$samples$Developmental.stage[i] == "Amastigote 72hr"){
hpi[i] = 72
}
} #end of for loop
My code works but I feel like there are better ways to do this.
Upvotes: 0
Views: 112
Reputation: 18642
I would recommend the answer from @akrun, but as an alternative:
as.numeric(gsub("\\D", "", dge$samples$Developmental.stage))
This deletes anything that's not a digit from your string.
From the base R
documentation for regex
:
Symbols \d, \s, \D and \S denote the digit and space classes and their negations (these are all extensions).
Benchmarking
An alternative if you wanted something fast, but strict (just as strict as your loop) would be to use a named vector. In the code below v <- dge$samples$Developental.stage
:
lookup <- c("Amastigote 4 hpi" = 4,
"Amastigote 6hr" = 6,
"Amastigote 12hr" = 12,
"Amastigote 24hr" = 24,
"Amastigote 48hr" = 48,
"Amastigote 72hr" = 72)
unname(lookup[v])
[1] 4 4 4 4 6 6 6 6 12 12 12 24 24 24 24 48 48 48 48 48 48 48 72 72 72 72 72
Comparing these four options (I stuck your option in a function named loop
) on my computer gave the following results:
library(microbenchmark)
microbenchmark(
readr::parse_number(v),
as.numeric(gsub("\\D", "", v)),
loop(),
unname(lookup[v]),
times = 1000L
)
Unit: microseconds
expr min lq mean median uq max neval cld
readr::parse_number(v) 39.100 42.001 45.987904 44.4010 46.7020 162.901 1000 c
as.numeric(gsub("\\\\D", "", v)) 85.701 87.601 90.999841 90.3015 91.8005 179.001 1000 d
loop() 29.501 30.601 32.444417 31.9010 32.9000 92.401 1000 b
unname(lookup[v]) 1.801 2.501 3.410506 3.4010 4.0000 33.301 1000 a
I would mention that looping and the named vector are both very strict and not very flexible. If there was an extra space or a misspelling then it would not be matched. These two options are both the fastest because of the direct comparison without having to parse a string.
parse_numbers
is both readable, not terribly slow, and more flexible by comparison.
If you were worried about speed and know your data very well then the named vector might not be a bad option.
Lastly, you could create the named vector using parse_numbers
and then apply it, which would give some flexibility. However there is some overhead in setting up the named vector this way:
u <- unique(v)
lookup <- setNames(readr::parse_number(u), u)
unname(lookup[v])
With very large vectors this method is slightly faster than parse_numbers
, but they're both far superior to the other options.
Upvotes: 3
Reputation: 887118
based on the input data, we could parse the numeric part with parse_number
from readr
readr::parse_number(dge$samples$Developmental.stage)
#[1] 4 4 4 4 6 6 6 6 12 12 12 24 24 24 24 48 48 48 48 48 48 48 72 72 72 72 72
dge$samples$Developental.stage
c("Amastigote 4 hpi", "Amastigote 4 hpi", "Amastigote 4 hpi",
"Amastigote 4 hpi", "Amastigote 6hr", "Amastigote 6hr", "Amastigote 6hr",
"Amastigote 6hr", "Amastigote 12hr", "Amastigote 12hr", "Amastigote 12hr",
"Amastigote 24hr", "Amastigote 24hr", "Amastigote 24hr", "Amastigote 24hr",
"Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr",
"Amastigote 48hr", "Amastigote 48hr", "Amastigote 48hr", "Amastigote 72hr",
"Amastigote 72hr", "Amastigote 72hr", "Amastigote 72hr", "Amastigote 72hr"
)
Upvotes: 2