Reputation: 129
I currently have a date column within a data.table table, which uses the data.table package. The way I convert a column of dates (format is "Date" format) to character would be using the following function:
data[,a:= as.character(a)]
Here, we assume that a is a column of dates in the data table "data", which is has 300 million entires in the column a.
However, this takes over 50 GB of RAM, computing power that doesn't come cheap. Is there a better workaround or technique?
Upvotes: 0
Views: 115
Reputation: 8903
This was the fastest option in my benchmark:
> v=as.Date(sample(18000:19000,1e6,T))
> u=unique(v);as.character(u)[match(v,u)]|>str()
chr [1:1000000] "2021-05-06" "2021-01-17" "2021-08-27" "2020-04-03" ...
Benchmark:
v=as.Date(sample(18000:19000,1e6,T))
b=microbenchmark(times=10,
as.character(seq(min(v),max(v),1))[as.numeric(v-min(v)+1)],
as.character(v),
{u=unique(v);as.character(u)[match(v,u)]},
data.table(x=v)[,y:=as.character(x),by=x]$y,
{d=data.table(a=v);d[,a:=as.character(as.numeric(a))];d[,a:=rep(as.character(as.Date(as.numeric(a[1]), origin="1970-01-01")),length(a)),by=a]}
)
o=sort(tapply(b$time,b$expr,median))
writeLines(sprintf("%.2f %s",o/min(o),names(o)))
The output shows the median time of ten runs relative to the fastest method:
1.00 { u = unique(v) as.character(u)[match(v, u)] }
1.05 as.character(seq(min(v), max(v), 1))[as.numeric(v - min(v) + 1)]
2.16 data.table(x = v)[, `:=`(y, as.character(x)), by = x]$y
12.80 { d = data.table(a = v) d[, `:=`(a, as.character(as.numeric(a)))] d[, `:=`(a, rep(as.character(as.Date(as.numeric(a[1]), origin = "1970-01-01")), length(a))), by = a] }
31.77 as.character(v)
Upvotes: 0
Reputation: 132746
Test this:
dates <- seq(as.Date("2014-01-01"), as.Date("2014-02-01"), by="1 day")
system.time({
DT1 <- data.table(a=rep(dates, 1e4), b=rnorm(32e4))
DT1[, a := as.character(a)]
})
# user system elapsed
# 3.03 0.00 3.02
system.time({
DT2 <- data.table(a=rep(dates, 1e4), b=rnorm(32e4))
DT2[, a := as.character(as.numeric(a))]
DT2[, a := rep(as.character(as.Date(as.numeric(a[1]), origin="1970-01-01")),
length(a)), by=a]
})
# user system elapsed
# 0.44 0.00 0.44
all.equal(DT1[,a], DT2[,a])
#[1] TRUE
Upvotes: 4