Reputation: 8064
I have a lot of large csv files (several MB), with a lot of numbers and times, where all values are stored without quotation marks, and separated by semicolons. Numbers are encoded with colons for decimal separators. Times are encoded as hh:mm:ss. Many values are missing, and they are encoded as string with size zero (not "NA").
I need to convert strings representing numbers to numbers, and times to some numerical representation, that would allow easy arithmetic operations. AFAIK the default date/time representation in R is quite bulky, the POSIXct object on my system is 312 bytes long.
I use a piece of code like this:
library(chron) #For chron function
library(plyr)
string2time<-function(f)
{
if(is.na(f))
{
return(NA)
}
if(f==''){
return(NA)
}
n<-as.numeric(f)
if(is.na(n))
{
ans<-n
try(ans<-chron(times.=f),
silent=TRUE)
return(as.numeric(ans))
}
return(n)
}
interpreteCSV<-function(filename,parallel)
{
data<-read.csv2(filename)
data.fix<-t(laply(data,function(x) {aaply(as.character(x),1,string2time)},.parallel=parallel))
return(data.fix)
}
The problem is that the function is seems very inefficient.
I know, that the function looks awkward, but it is a that it is the only that works with the missing values. Maybe I could speed it up a little, if I wrote two functions: one for numeric columns, and the other for times columns. But something tells me, that I'm missing here something fundamental here, and that's why I ask you for some inspiration.
Here is a profiling excerpt:
> ?Rprof
> summaryRprof(filename="/tmp/temp/rprof.out")
$by.self
self.time self.pct total.time total.pct
".Call" 0.30 93.75 0.30 93.75
"getIterVal.containeriter" 0.02 6.25 0.02 6.25
$by.total
total.time total.pct self.time self.pct
"<Anonymous>" 0.32 100.00 0.00 0.00
"a_ply" 0.32 100.00 0.00 0.00
"%dopar%" 0.32 100.00 0.00 0.00
"l_ply" 0.32 100.00 0.00 0.00
"prepare.rds.from.csv" 0.32 100.00 0.00 0.00
".Call" 0.30 93.75 0.30 93.75
"mclapply" 0.30 93.75 0.00 0.00
"FUN" 0.16 50.00 0.00 0.00
"lapply" 0.16 50.00 0.00 0.00
"selectChildren" 0.16 50.00 0.00 0.00
A sample line of csv looks like that (excluding the header). There are several thousand such lines in the file:
1;XAD-01;m;wn;18,00;6;7;6;60;;;;15:10:11.8;;2,73;;16,56;;15:10:19.4;;15:10:11.8;;15:10:23.9;;15:10:19.4;;15:10:28.6;;0;;0;;0;;0;;15:10:06.0;;59,70;;15:10:36.6;;-1,47;;617;;0;;3,65;;29,96;;15:10:33.4;;15,97;;15:10:33.4;;4,43;5;12;4;30;8;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
Upvotes: 0
Views: 442
Reputation: 263411
Let's assume you know which column numbers need conversion. Further assume this set of columns is in a vector named cols.to,convert
:
cols.to.convert <- grep( "\\d{1,2}\\:\\d{1,2}\\:\\d{1,2}", dfrm[1,])
cols.to.convert
# [1] 13 19 21 23 25 27 37 41 53 57
dfrm[cols.to.convert] <- lapply(dfrm[cols.to.convert], as.POSIXct, format="%H:%M:%S")
# all now POSIXct class.
lapply(dfrm[cols.to.convert], class)
Note that an R POSIXct
vector will have both a date and a time and by default converting from a time-only format will return a date-time with today's date:
> dfrm$V57
[1] "2014-01-12 15:10:33 PST"
Then for the NA conversion this should be fairly efficient:
is.na(dfrm) <- dfrm = =""
Upvotes: 1