Reputation: 91
assume I have a data frame "data" with time series data of a measured variable x:
date x
2009/10/01 00:00 10
2009/10/01 01:00 11
2009/10/01 02:00 12
2009/10/01 03:00 13
2009/10/01 04:00 14
2009/10/01 05:00 15
2009/10/01 06:00 16
2009/10/01 07:00 17
2009/10/01 08:00 18
2009/10/01 09:00 19
2009/10/01 10:00 20
2009/10/01 11:00 21
2009/10/01 12:00 22
2009/10/01 13:00 23
2009/10/01 14:00 24
2009/10/01 15:00 25
2009/10/01 16:00 26
2009/10/01 17:00 27
2009/10/01 18:00 28
2009/10/01 19:00 29
2009/10/01 20:00 30
2009/10/01 21:00 31
2009/10/01 22:00 32
2009/10/01 23:00 33
2009/10/02 00:00 34
...
and another data frame "events" with different time periods defined by a start and stop date:
id start stop
1 2009/10/01 02:00 2009/10/01 04:00
2 2009/10/01 07:00 2009/10/01 10:00
3 2009/10/01 08:00 2009/10/01 20:00
...
Now I would like to get a table of the mean values of x within the different events like this:
id mean.x
1 13
2 18.5
3 25.5
In a database I do a simple SQL statement like this:
SELECT a.id, avg(b.x)
FROM events as a, data as b
WHERE b.date between a.start and a.stop
GROUP BY a.id
I wonder how I could do such averaging in R? I could use 'aggregate' if I had an id column in data indicating which data point falls within wich event, but I cannot find a way to create this column...
Any suggestions would be much appreciated.
cin
Edit:
dput(data):
structure(list(date = structure(c(1254348000, 1254351600, 1254355200,
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800,
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400,
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000,
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Berlin"), x = 10:34), .Names = c("date",
"x"), row.names = c(NA, -25L), class = "data.frame")
dput(events):
structure(list(id = 1:3, start = structure(c(1254355200, 1254373200,
1254387600), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"),
stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start",
"stop"), row.names = c(NA, -3L), class = "data.frame")
Edit2:
dput(events2):
structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800,
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"),
stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start",
"stop"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 2
Views: 2514
Reputation: 269501
Try this:
library(sqldf)
sqldf("
SELECT a.id, avg(b.x)
FROM events as a, data as b
WHERE b.date between a.start and a.stop
GROUP BY a.id
")
Upvotes: 4
Reputation: 78590
You can get a list of what event each date falls into like this:
event.indices = sapply(data$date, function(d) which(d >= events$start & d <= events$stop))
The problem is that some of them will be null if they don't fall into any event. So you replace those with NA's and you're on your way:
event.indices = sapply(event.indices, function(i) ifelse(is.null(i), NA, i))
On your data, this is:
[1] NA NA 1 1 1 NA NA 2 2 2 2 3 3 3 3 3 3 3 3 3 3 NA NA NA NA
You can then add that to your original data frame, and use aggregate just as you originally suggested.
NOTE: This is assuming that each date is guaranteed to fall into at most one event. If the events overlap (even if one begins precisely when the last one ends), you'll have to decide what to do with dates that fall under multiple events and adjust appropriately.
Upvotes: 1