Stephen Paulger
Stephen Paulger

Reputation: 5343

Reading durations

I have a CSV file containing times per competitor of each section of a triathlon. I am having trouble reading the data so that R can use it. Here is an example of how the data looks (I've removed some columns for clarity):

"Place","Division","Gender","Swim","T1","Bike","T2","Run","Finish"
1, "40-49","M","7:45","0:55","27:07","0:29","18:53","55:07"
2, "UNDER 18","M","5:41","0:28","30:41","0:28","18:38","55:55"
3, "40-49","M","6:27","0:26","29:24","0:40","20:16","57:11"
4, "40-49","M","7:57","0:35","29:19","0:23","19:20","57:32"
5, "40-49","M","6:28","0:32","31:00","0:34","19:19","57:51"
6, "40-49","M","7:42","0:30","30:02","0:37","19:11","58:02"
....
250 ,"18-29","F","13:20","3:23","1:06:40","1:19","38:00","2:02:40"
251 ,"30-39","F","13:01","2:42","1:02:12","1:20","43:45","2:02:58"
252 ,50 ,"F","20:45","1:33","58:09","3:17","40:14","2:03:56"
253 ,"30-39","M","13:14","1:14","DNF","1:11","25:10","DNF bike"
254 ,"40-49","M","10:04","1:41","56:36","2:32",,"D.N.F"

My first naive attempt to plot the data went like this.

> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> pairs(~ Bike + Run + Swim, data=tri)

The times are not being imported in a sensible way so the charts don't make sense.

I have found the difftime type and have tried to use it to parse the times in the data file.

There are some rows with DNF or similar in place of times, I'm happy for rows with times that can't be parsed to be discarded. There are two formats for the times "%M:%S" and "%H:%M:%S"

I think I need to create a new data frame from the data but I am having trouble parsing the times. This is what I have so far.

> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> str(tri)
'data.frame':   254 obs. of  12 variables:
 $ Place     : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Race..    : num  237 274 268 226 267 247 264 257 273 272 ...
 $ First.Name: chr  ** removed names ** ...
 $ Last.Name : chr  ** removed names ** ...
 $ Division  : chr  "40-49" "UNDER 18" "40-49" "40-49" ...
 $ Gender    : chr  "M" "M" "M" "M" ...
 $ Swim      : chr  "7:45" "5:41" "6:27" "7:57" ...
 $ T1        : chr  "0:55" "0:28" "0:26" "0:35" ...
 $ Bike      : chr  "27:07" "30:41" "29:24" "29:19" ...
 $ T2        : chr  "0:29" "0:28" "0:40" "0:23" ...
 $ Run       : chr  "18:53" "18:38" "20:16" "19:20" ...
 $ Finish    : chr  "55:07" "55:55" "57:11" "57:32" ...
> as.numeric(as.difftime(tri$Bike, format="%M:%S"), units="secs")

This converts all the times that are under one hour, but the hours are interpreted as minutes for any times over an hour. Substituting "%H:%M:%S" for "%M:%S" parses times over an hour but produces NA otherwise. What is the best way to convert both types of times?

EDIT: Adding a simple example as requested.

> times <- c("27:07", "1:02:12", "DNF")
> as.numeric(as.difftime(times, format="%M:%S"), units="secs")
[1] 1627   62   NA
> as.numeric(as.difftime(times, format="%H:%M:%S"), units="secs")
[1]   NA 3732   NA

The output I would like would be 1627 3732 NA

Upvotes: 1

Views: 114

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226332

Here's a quick hack at a solution, although there may be a better one:

cdifftime <- function(x) {
   x2 <- gsub("^([0-9]+:[0-9]+)$","00:\\1",x)  ## prepend 00: to %M:%S elements
   res <- as.difftime(x2,format="%H:%M:%S")  
   units(res) <- "secs"
   as.numeric(res)
}
times <- c("27:07", "1:02:12", "DNF")
cdifftime(times)
## [1] 1627 3732   NA

You can apply this to the relevant columns:

tri[4:9] <- lapply(tri[4:9],cdifftime)

A couple of notes from trying to replicate your example:

  • you may want to use na.strings="DNF" to set "did not finish" values to NA automatically
  • you need to make sure strings are not read in as factors, e.g. (1) set options(stringsAsFactors="FALSE"); (2) use stringsAsFactors=FALSE when calling read.csv; (3) use as.is=TRUE, ditto.

Upvotes: 4

Related Questions