Reputation: 3067
I have a data frame with some file names and sized and I would like to normalize the sizes for all rows to same unit (Mb):
This is my original data:
filename size
1 A 100Kb
2 B 200Kb
3 C 30Kb
4 D 1Mb
5 E 10Mb
This is what I am looking for (normalize the size to Mb):
filename size (Mb)
1 A 0.1
2 B 0.2
3 C 0.03
4 D 1
5 E 10
This is my original data frame:
df=rbind(c("3/22/2016", "2:36:41 PM", "3.1Kb", "HiSeqControlSoftware.Options.cfg", "character(0)"),
c("3/22/2016", "2:36:41 PM", "32.7Kb", "Variability_HiSeq_E.bin", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "Variability_HiSeq_E.bin", "74"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HiSeqControlSoftware.Options.cfg", "76"),
c("3/22/2016", "2:36:42 PM", "20Kb", "HK7N2CCXX.xml", "character(0)"),
c("3/22/2016", "2:36:42 PM", "character(0)", "HK7N2CCXX.xml", "26"),
c("3/22/2016", "2:36:42 PM", "9.4Kb", "runParameters.xml", "character(0)"))
df = as.data.frame(df)
colnames(df) = c("date","timestamp","filesize","filename","time")
How would I do this?
thank you
Upvotes: 0
Views: 54
Reputation: 887118
We can use gsubfn
to do replace the non-numeric substring in the 'size' with ''
and /1e3
and then use eval(parse
to get the expected output.
library(gsubfn)
unname(sapply(gsubfn('[A-Za-z]+', list(Mb='', Kb = '/1e3'),
as.character(df$size)), function(x) eval(parse(text=x))))
#[1] 0.10 0.20 0.03 1.00 10.00
Or with sub
from base R
by replacing the numeric substring in 'size', match it with a key/value vector (setNames(c(1/1e3, 1), c("Kb", "Mb"))
and multiply with the numeric part of 'size' by removing the non-numeric characters with sub
(sub("\\D+", "", df$size)
).
df$size_Mb <- (setNames(c(1/1e3, 1), c("Kb", "Mb")) [sub("\\d+", "",
df$size)]) * as.numeric(sub("\\D+", "", df$size))
df$size_Mb
#[1] 0.10 0.20 0.03 1.00 10.00
Update
For the new dataset
v1 <- setNames(c(1/1e3, 1), c("Kb", "Mb"))
v1[sub("[^[:alpha:]]+", "", df$filesize)]*
as.numeric(sub("[[:alpha:]]+", "", df$filesize))
# Kb Kb <NA> <NA> Kb <NA> Kb
#0.0031 0.0327 NA NA 0.0200 NA 0.0094
Upvotes: 1
Reputation: 43334
There are options, but you need to chop the character strings into the numeric part and the label part. One option which assumes you only have Mb
and Kb
(but could be extended) is using substr
:
df$size <- as.character(df$size)
df$size <- ifelse(substr(df$size, nchar(df$size)-1, nchar(df$size)) == 'Mb',
as.numeric(substr(df$size, 1, nchar(df$size)-2)),
as.numeric(substr(df$size, 1, nchar(df$size)-2))/1000)
names(df)[2] <- 'size_Mb'
df
# filename size_Mb
# 1 A 0.10
# 2 B 0.20
# 3 C 0.03
# 4 D 1.00
# 5 E 10.00
tidyr::separate
can also replace all the substr
by separating size
into two columns, simplifying the code a lot:
library(tidyr)
df <- separate(df, size, c('size', 'label'), sep = -3, convert = TRUE)
df$size_Mb <- ifelse(df$label == 'Mb', df$size, df$size/1000)
df <- df[,c('filename', 'size_Mb')]
If you combine tidyr
and dplyr
, you can do it all in one chain, if you like:
library(tidyr)
library(dplyr)
df %>% separate(size, c('size', 'label'), sep = -3, convert = TRUE) %>%
mutate(size = ifelse(label == 'Mb', size, size/1000)) %>%
select(-label)
If you'd rather, you can separate with regex with something like
df$label <- sub('\\d+(\\w+)', '\\1', df$size)
df$size <- as.numeric(sub('(\\d+)\\w+', '\\1', df$size))
Upvotes: 2