masber
masber

Reputation: 3067

How to format the data within my dataset in R?

I have a data frame with some file names and sized and I would like to normalize the sizes for all rows to same unit (Mb):

This is my original data:

  filename  size
1        A 100Kb
2        B 200Kb
3        C  30Kb
4        D   1Mb
5        E  10Mb

This is what I am looking for (normalize the size to Mb):

  filename  size (Mb)
1        A        0.1
2        B        0.2
3        C       0.03
4        D          1
5        E         10

This is my original data frame:

df=rbind(c("3/22/2016", "2:36:41 PM", "3.1Kb", "HiSeqControlSoftware.Options.cfg", "character(0)"),
c("3/22/2016", "2:36:41 PM", "32.7Kb", "Variability_HiSeq_E.bin", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "Variability_HiSeq_E.bin", "74"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HiSeqControlSoftware.Options.cfg", "76"),
c("3/22/2016", "2:36:42 PM", "20Kb", "HK7N2CCXX.xml", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HK7N2CCXX.xml", "26"),
c("3/22/2016", "2:36:42 PM", "9.4Kb", "runParameters.xml", "character(0)"))
df = as.data.frame(df)
colnames(df) = c("date","timestamp","filesize","filename","time")

How would I do this?

thank you

Upvotes: 0

Views: 54

Answers (2)

akrun
akrun

Reputation: 887118

We can use gsubfn to do replace the non-numeric substring in the 'size' with '' and /1e3 and then use eval(parse to get the expected output.

library(gsubfn)
unname(sapply(gsubfn('[A-Za-z]+', list(Mb='', Kb = '/1e3'), 
    as.character(df$size)), function(x) eval(parse(text=x))))
#[1]  0.10  0.20  0.03  1.00 10.00

Or with sub from base R by replacing the numeric substring in 'size', match it with a key/value vector (setNames(c(1/1e3, 1), c("Kb", "Mb")) and multiply with the numeric part of 'size' by removing the non-numeric characters with sub (sub("\\D+", "", df$size)).

df$size_Mb <- (setNames(c(1/1e3, 1), c("Kb", "Mb")) [sub("\\d+", "", 
   df$size)]) * as.numeric(sub("\\D+", "", df$size))
df$size_Mb
#[1]  0.10  0.20  0.03  1.00 10.00

Update

For the new dataset

 v1 <- setNames(c(1/1e3, 1), c("Kb", "Mb"))
 v1[sub("[^[:alpha:]]+", "", df$filesize)]*
       as.numeric(sub("[[:alpha:]]+", "", df$filesize))
 #    Kb     Kb   <NA>   <NA>     Kb   <NA>     Kb 
 #0.0031 0.0327     NA     NA 0.0200     NA 0.0094 

Upvotes: 1

alistaire
alistaire

Reputation: 43334

There are options, but you need to chop the character strings into the numeric part and the label part. One option which assumes you only have Mb and Kb (but could be extended) is using substr:

df$size <- as.character(df$size)

df$size <- ifelse(substr(df$size, nchar(df$size)-1, nchar(df$size)) == 'Mb',
                  as.numeric(substr(df$size, 1, nchar(df$size)-2)),
                  as.numeric(substr(df$size, 1, nchar(df$size)-2))/1000)

names(df)[2] <- 'size_Mb'
df
#   filename size_Mb
# 1        A    0.10
# 2        B    0.20
# 3        C    0.03
# 4        D    1.00
# 5        E   10.00

tidyr::separate can also replace all the substr by separating size into two columns, simplifying the code a lot:

library(tidyr)
df <- separate(df, size, c('size', 'label'), sep = -3, convert = TRUE)
df$size_Mb <- ifelse(df$label == 'Mb', df$size, df$size/1000)
df <- df[,c('filename', 'size_Mb')]

If you combine tidyr and dplyr, you can do it all in one chain, if you like:

library(tidyr)
library(dplyr)
df %>% separate(size, c('size', 'label'), sep = -3, convert = TRUE) %>% 
    mutate(size = ifelse(label == 'Mb', size, size/1000)) %>% 
    select(-label)

If you'd rather, you can separate with regex with something like

df$label <- sub('\\d+(\\w+)', '\\1', df$size)
df$size <- as.numeric(sub('(\\d+)\\w+', '\\1', df$size))

Upvotes: 2

Related Questions