Dan Barowy
Dan Barowy

Reputation: 2270

How to select and plot hourly averages from data frame?

I have a CSV file that looks like this, where "time" is a UNIX timestamp:

time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14
... and so on

I am reading it into R and converting the time column into POSIXct like so:

data <- read.csv(file="data.csv",head=TRUE,sep=",")
data[,1] <- as.POSIXct(data[,1], origin="1970-01-01")

Great so far, but now I would like to build a histogram with each bin corresponding to the average hourly count. I'm stuck on selecting by hour and then counting. I've looked through ?POSIXt and ?cut.POSIXt, but if the answer is in there, I am not seeing it.

Any help would be appreciated.

Upvotes: 3

Views: 3268

Answers (3)

alicederyn
alicederyn

Reputation: 13247

There's a good post on this topic on Mages' blog. To get the bucketed data:

aggregate(. ~ cut(time, 'hours'), data, mean)

If you just want a quick graph, ggplot2 is your friend:

qplot(cut(time, "hours"), count, data=data, stat='summary', fun.y='mean')

Unfortunately, because cut returns a factor, the x axis won't work properly. You may want to write your own, less awkward bucketing function for time, e.g.

timebucket = function(x, bucketsize = 1,
                      units = c("secs", "mins",  "hours", "days", "weeks")) {
  secs = as.numeric(as.difftime(bucketsize, units=units[1]), units="secs")
  structure(floor(as.numeric(x) / secs) * secs, class=c('POSIXt','POSIXct'))
}
qplot(timebucket(time, units="hours"), ...)

Upvotes: 1

Dirk is no longer here
Dirk is no longer here

Reputation: 368241

Here is one way:

R> lines <- "time,count
1300162432,5
1299849832,0
1300006132,1
1300245532,4
1299932932,1
1300089232,1
1299776632,9
1299703432,14"
R> con <- textConnection(lines); df <- read.csv(con); close(con)
R> df$time <- as.POSIXct(df$time, origin="1970-01-01")
R> df$hour <- as.POSIXlt(df$time)$hour
R> df
                 time count hour
1 2011-03-15 05:13:52     5    5
2 2011-03-11 13:23:52     0   13
3 2011-03-13 09:48:52     1    9
4 2011-03-16 04:18:52     4    4
5 2011-03-12 12:28:52     1   12
6 2011-03-14 08:53:52     1    8
7 2011-03-10 17:03:52     9   17
8 2011-03-09 20:43:52    14   20
R> tapply(df$count, df$hour, FUN=mean)
 4  5  8  9 12 13 17 20 
 4  5  1  1  1  0  9 14 
R> 

Your data doesn't actually yet have multiple entries per hour-of-the-day but this would average over the hours, properly parsed from the POSIX time stamps. You can adjust with TZ info as needed.

Upvotes: 3

Daniel Dickison
Daniel Dickison

Reputation: 21882

You can calculate the hour "bin" for each time by converting to a POSIXlt and subtracting away the minute and seconds components. Then you can add a new column to your data frame that would contain the hour bin marker, like so:

date.to.hour <- function (vec)
{
    as.POSIXct(
        sapply(
            vec,
            function (x)
            {
                lt = as.POSIXlt(x)
                x - 60*lt$min - lt$sec
            }),
        tz="GMT",
        origin="1970-01-01")
}

data$hour <- date.to.hour(as.POSIXct(data[,1], origin="1970-01-01"))

Upvotes: 1

Related Questions