James Schwartz
James Schwartz

Reputation: 129

Is there a faster way to convert a column of dates to a column of characters for LARGE data sets? (300 million entires)

I currently have a date column within a data.table table, which uses the data.table package. The way I convert a column of dates (format is "Date" format) to character would be using the following function:

 data[,a:= as.character(a)]

Here, we assume that a is a column of dates in the data table "data", which is has 300 million entires in the column a.

However, this takes over 50 GB of RAM, computing power that doesn't come cheap. Is there a better workaround or technique?

Upvotes: 0

Views: 115

Answers (2)

nisetama
nisetama

Reputation: 8903

This was the fastest option in my benchmark:

> v=as.Date(sample(18000:19000,1e6,T))
> u=unique(v);as.character(u)[match(v,u)]|>str()
 chr [1:1000000] "2021-05-06" "2021-01-17" "2021-08-27" "2020-04-03" ...

Benchmark:

v=as.Date(sample(18000:19000,1e6,T))

b=microbenchmark(times=10,
  as.character(seq(min(v),max(v),1))[as.numeric(v-min(v)+1)],
  as.character(v),
  {u=unique(v);as.character(u)[match(v,u)]},
  data.table(x=v)[,y:=as.character(x),by=x]$y,
  {d=data.table(a=v);d[,a:=as.character(as.numeric(a))];d[,a:=rep(as.character(as.Date(as.numeric(a[1]), origin="1970-01-01")),length(a)),by=a]}
)

o=sort(tapply(b$time,b$expr,median))
writeLines(sprintf("%.2f %s",o/min(o),names(o)))

The output shows the median time of ten runs relative to the fastest method:

1.00 { u = unique(v) as.character(u)[match(v, u)] }
1.05 as.character(seq(min(v), max(v), 1))[as.numeric(v - min(v) + 1)]
2.16 data.table(x = v)[, `:=`(y, as.character(x)), by = x]$y
12.80 { d = data.table(a = v) d[, `:=`(a, as.character(as.numeric(a)))] d[, `:=`(a, rep(as.character(as.Date(as.numeric(a[1]), origin = "1970-01-01")), length(a))), by = a] }
31.77 as.character(v)

Upvotes: 0

Roland
Roland

Reputation: 132746

Test this:

dates <- seq(as.Date("2014-01-01"), as.Date("2014-02-01"), by="1 day")

system.time({
DT1 <- data.table(a=rep(dates, 1e4), b=rnorm(32e4))
DT1[, a := as.character(a)]
})
#   user  system elapsed 
#   3.03    0.00    3.02 

system.time({
DT2 <- data.table(a=rep(dates, 1e4), b=rnorm(32e4))
DT2[, a := as.character(as.numeric(a))]
DT2[, a := rep(as.character(as.Date(as.numeric(a[1]), origin="1970-01-01")), 
     length(a)), by=a]
})
#   user  system elapsed 
#   0.44    0.00    0.44

all.equal(DT1[,a], DT2[,a])
#[1] TRUE

Upvotes: 4

Related Questions