Adam Ryczkowski
Adam Ryczkowski

Reputation: 8064

What is a fast way to convert a string date to a number?

I have a lot of large csv files (several MB), with a lot of numbers and times, where all values are stored without quotation marks, and separated by semicolons. Numbers are encoded with colons for decimal separators. Times are encoded as hh:mm:ss. Many values are missing, and they are encoded as string with size zero (not "NA").

I need to convert strings representing numbers to numbers, and times to some numerical representation, that would allow easy arithmetic operations. AFAIK the default date/time representation in R is quite bulky, the POSIXct object on my system is 312 bytes long.

I use a piece of code like this:

library(chron) #For chron function
library(plyr)

string2time<-function(f)
{
  if(is.na(f))
  {
    return(NA)
  }
  if(f==''){
    return(NA)
  } 
  n<-as.numeric(f)
  if(is.na(n))
  {
    ans<-n
    try(ans<-chron(times.=f),
        silent=TRUE)
    return(as.numeric(ans))
  }
  return(n)
}

interpreteCSV<-function(filename,parallel)
{
   data<-read.csv2(filename)
   data.fix<-t(laply(data,function(x) {aaply(as.character(x),1,string2time)},.parallel=parallel))
   return(data.fix)
}

The problem is that the function is seems very inefficient.

I know, that the function looks awkward, but it is a that it is the only that works with the missing values. Maybe I could speed it up a little, if I wrote two functions: one for numeric columns, and the other for times columns. But something tells me, that I'm missing here something fundamental here, and that's why I ask you for some inspiration.

Here is a profiling excerpt:

> ?Rprof
> summaryRprof(filename="/tmp/temp/rprof.out")
$by.self
                           self.time self.pct total.time total.pct
".Call"                         0.30    93.75       0.30     93.75
"getIterVal.containeriter"      0.02     6.25       0.02      6.25

$by.total
                           total.time total.pct self.time self.pct
"<Anonymous>"                    0.32    100.00      0.00     0.00
"a_ply"                          0.32    100.00      0.00     0.00
"%dopar%"                        0.32    100.00      0.00     0.00
"l_ply"                          0.32    100.00      0.00     0.00
"prepare.rds.from.csv"           0.32    100.00      0.00     0.00
".Call"                          0.30     93.75      0.30    93.75
"mclapply"                       0.30     93.75      0.00     0.00
"FUN"                            0.16     50.00      0.00     0.00
"lapply"                         0.16     50.00      0.00     0.00
"selectChildren"                 0.16     50.00      0.00     0.00

A sample line of csv looks like that (excluding the header). There are several thousand such lines in the file:

1;XAD-01;m;wn;18,00;6;7;6;60;;;;15:10:11.8;;2,73;;16,56;;15:10:19.4;;15:10:11.8;;15:10:23.9;;15:10:19.4;;15:10:28.6;;0;;0;;0;;0;;15:10:06.0;;59,70;;15:10:36.6;;-1,47;;617;;0;;3,65;;29,96;;15:10:33.4;;15,97;;15:10:33.4;;4,43;5;12;4;30;8;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

Upvotes: 0

Views: 442

Answers (1)

IRTFM
IRTFM

Reputation: 263411

Let's assume you know which column numbers need conversion. Further assume this set of columns is in a vector named cols.to,convert:

cols.to.convert <-  grep( "\\d{1,2}\\:\\d{1,2}\\:\\d{1,2}", dfrm[1,])
cols.to.convert
# [1] 13 19 21 23 25 27 37 41 53 57

dfrm[cols.to.convert]  <- lapply(dfrm[cols.to.convert], as.POSIXct, format="%H:%M:%S")

# all now POSIXct class.
lapply(dfrm[cols.to.convert], class)

Note that an R POSIXct vector will have both a date and a time and by default converting from a time-only format will return a date-time with today's date:

> dfrm$V57
[1] "2014-01-12 15:10:33 PST"

Then for the NA conversion this should be fairly efficient:

  is.na(dfrm) <- dfrm = =""

Upvotes: 1

Related Questions