How to format the data within my dataset in R?

Question

I have a data frame with some file names and sized and I would like to normalize the sizes for all rows to same unit (Mb):

This is my original data:

  filename  size
1        A 100Kb
2        B 200Kb
3        C  30Kb
4        D   1Mb
5        E  10Mb

This is what I am looking for (normalize the size to Mb):

  filename  size (Mb)
1        A        0.1
2        B        0.2
3        C       0.03
4        D          1
5        E         10

This is my original data frame:

df=rbind(c("3/22/2016", "2:36:41 PM", "3.1Kb", "HiSeqControlSoftware.Options.cfg", "character(0)"),
c("3/22/2016", "2:36:41 PM", "32.7Kb", "Variability_HiSeq_E.bin", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "Variability_HiSeq_E.bin", "74"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HiSeqControlSoftware.Options.cfg", "76"),
c("3/22/2016", "2:36:42 PM", "20Kb", "HK7N2CCXX.xml", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HK7N2CCXX.xml", "26"),
c("3/22/2016", "2:36:42 PM", "9.4Kb", "runParameters.xml", "character(0)"))
df = as.data.frame(df)
colnames(df) = c("date","timestamp","filesize","filename","time")

How would I do this?

thank you

akrun · Accepted Answer

We can use gsubfn to do replace the non-numeric substring in the 'size' with '' and /1e3 and then use eval(parse to get the expected output.

library(gsubfn)
unname(sapply(gsubfn('[A-Za-z]+', list(Mb='', Kb = '/1e3'), 
    as.character(df$size)), function(x) eval(parse(text=x))))
#[1]  0.10  0.20  0.03  1.00 10.00

Or with sub from base R by replacing the numeric substring in 'size', match it with a key/value vector (setNames(c(1/1e3, 1), c("Kb", "Mb")) and multiply with the numeric part of 'size' by removing the non-numeric characters with sub (sub("\D+", "", df$size)).

df$size_Mb <- (setNames(c(1/1e3, 1), c("Kb", "Mb")) [sub("\d+", "", 
   df$size)]) * as.numeric(sub("\D+", "", df$size))
df$size_Mb
#[1]  0.10  0.20  0.03  1.00 10.00

Update

For the new dataset

 v1 <- setNames(c(1/1e3, 1), c("Kb", "Mb"))
 v1[sub("[^[:alpha:]]+", "", df$filesize)]*
       as.numeric(sub("[[:alpha:]]+", "", df$filesize))
 #    Kb     Kb           Kb        Kb 
 #0.0031 0.0327     NA     NA 0.0200     NA 0.0094

How to format the data within my dataset in R?

Answers (2)

Related Questions