cin
cin

Reputation: 91

R: How to average data within different time periods defined in a data frame?

assume I have a data frame "data" with time series data of a measured variable x:

     date            x
2009/10/01 00:00    10
2009/10/01 01:00    11
2009/10/01 02:00    12
2009/10/01 03:00    13
2009/10/01 04:00    14
2009/10/01 05:00    15
2009/10/01 06:00    16
2009/10/01 07:00    17
2009/10/01 08:00    18
2009/10/01 09:00    19
2009/10/01 10:00    20
2009/10/01 11:00    21
2009/10/01 12:00    22
2009/10/01 13:00    23
2009/10/01 14:00    24
2009/10/01 15:00    25
2009/10/01 16:00    26
2009/10/01 17:00    27
2009/10/01 18:00    28
2009/10/01 19:00    29
2009/10/01 20:00    30
2009/10/01 21:00    31
2009/10/01 22:00    32
2009/10/01 23:00    33
2009/10/02 00:00    34
...

and another data frame "events" with different time periods defined by a start and stop date:

id        start              stop
1   2009/10/01 02:00    2009/10/01 04:00
2   2009/10/01 07:00    2009/10/01 10:00
3   2009/10/01 08:00    2009/10/01 20:00
...

Now I would like to get a table of the mean values of x within the different events like this:

id  mean.x
1   13
2   18.5
3   25.5

In a database I do a simple SQL statement like this:

SELECT a.id, avg(b.x) 
FROM events as a, data as b 
WHERE b.date between a.start and a.stop 
GROUP BY a.id 

I wonder how I could do such averaging in R? I could use 'aggregate' if I had an id column in data indicating which data point falls within wich event, but I cannot find a way to create this column...

Any suggestions would be much appreciated.

cin

Edit:

dput(data):

structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), x = 10:34), .Names = c("date", 
"x"), row.names = c(NA, -25L), class = "data.frame")

dput(events):

structure(list(id = 1:3, start = structure(c(1254355200, 1254373200, 
1254387600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")

Edit2:

dput(events2):

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 2

Views: 2514

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269501

Try this:

library(sqldf)
sqldf("
    SELECT a.id, avg(b.x) 
    FROM events as a, data as b 
    WHERE b.date between a.start and a.stop 
    GROUP BY a.id 
")

Upvotes: 4

David Robinson
David Robinson

Reputation: 78590

You can get a list of what event each date falls into like this:

event.indices = sapply(data$date, function(d) which(d >= events$start & d <= events$stop))

The problem is that some of them will be null if they don't fall into any event. So you replace those with NA's and you're on your way:

event.indices = sapply(event.indices, function(i) ifelse(is.null(i), NA, i))

On your data, this is:

  [1] NA NA  1  1  1 NA NA  2  2  2  2  3  3  3  3  3  3  3  3  3  3 NA NA NA NA

You can then add that to your original data frame, and use aggregate just as you originally suggested.

NOTE: This is assuming that each date is guaranteed to fall into at most one event. If the events overlap (even if one begins precisely when the last one ends), you'll have to decide what to do with dates that fall under multiple events and adjust appropriately.

Upvotes: 1

Related Questions