Reputation: 91
I have recently posted a similar question here, which was a bit oversimplified, though. Thus here we go again:
Assume I have a dataframe (dput output below) with time series data of many different variables (5 in this example, many more in the real data):
date a b c d e
1 2009-10-01 00:00:00 10 20 30 40 50
2 2009-10-01 01:00:00 11 21 31 41 51
3 2009-10-01 02:00:00 12 22 32 42 52
4 2009-10-01 03:00:00 13 23 33 43 53
5 2009-10-01 04:00:00 14 24 34 44 54
6 2009-10-01 05:00:00 15 25 35 45 55
7 2009-10-01 06:00:00 16 26 36 46 56
8 2009-10-01 07:00:00 17 27 37 47 57
9 2009-10-01 08:00:00 18 28 38 48 58
10 2009-10-01 09:00:00 19 29 39 49 59
11 2009-10-01 10:00:00 20 30 40 50 60
12 2009-10-01 11:00:00 21 31 41 51 61
13 2009-10-01 12:00:00 22 32 42 52 62
14 2009-10-01 13:00:00 23 33 43 53 63
15 2009-10-01 14:00:00 24 34 44 54 64
16 2009-10-01 15:00:00 25 35 45 55 65
17 2009-10-01 16:00:00 26 36 46 56 66
18 2009-10-01 17:00:00 27 37 47 57 67
19 2009-10-01 18:00:00 28 38 48 58 68
20 2009-10-01 19:00:00 29 39 49 59 69
21 2009-10-01 20:00:00 30 40 50 60 70
22 2009-10-01 21:00:00 31 41 51 61 71
23 2009-10-01 22:00:00 32 42 52 62 72
24 2009-10-01 23:00:00 33 43 53 63 73
25 2009-10-02 00:00:00 34 44 54 64 74
and another data frame "events" with different time periods defined by a start and stop date (3 here, many more in the real data):
id start stop
1 AGH 2009-10-01 02:00:00 2009-10-01 04:00:00
2 TRG 2009-10-01 03:00:00 2009-10-01 10:00:00
3 ZUH 2009-10-01 03:00:00 2009-10-01 20:00:00
I would like to get a table of the mean values of the variables within the different events like this:
id avg(y.a) avg(y.b) avg(y.c) avg(y.d) avg(y.e)
1 AGH 13.0 23.0 33.0 43.0 53.0
2 TRG 16.5 26.5 36.5 46.5 56.5
3 ZUH 21.5 31.5 41.5 51.5 61.5
I have learned from my previous post that I can do this using the sqldf package and a rather simple SQL statement:
means <- sqldf("
+ SELECT x.id, avg(y.a), avg(y.b), avg(y.c), avg(y.d), avg(y.e)
+ FROM events as x, data as y
+ WHERE y.date between x.start and x.stop
+ GROUP BY x.id
+ ")
However, as the real data contains many more columns to average, which are named differently in the various files I have to process, typing all the column names into the SQL statements becomes a bit tedious.
Thus I would prefer a solution in R, where I can simply refer to the columns by their number (data[2:100]) The difficulty is, though, that the time periods are non-continous and overlapping and the ids are character strings.
Any ideas how to do this would be much appreciated!
dput(data)
structure(list(date = structure(c(1254348000, 1254351600, 1254355200,
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800,
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400,
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000,
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Berlin"), a = 10:34, b = 20:44, c = 30:54,
d = 40:64, e = 50:74), .Names = c("date", "a", "b", "c",
"d", "e"), row.names = c(NA, -25L), class = "data.frame")
dput(events)
structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800,
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"),
stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start",
"stop"), row.names = c(NA, -3L), class = "data.frame")
Upvotes: 3
Views: 1243
Reputation: 269596
The basic problem is due to the fact that the data is not normalized; however, short of putting it into long form we could dynamically generate the sql statement:
library(sqldf)
sql <- paste("select id, ",
toString(sprintf("avg(y.%s)", names(data)[-1])),
"from events as x, data as y
where y.date between x.start and x.stop
group by x.id")
sqldf(sql)
As an alternative, we show the use of melt
in the reshape2 package to convert the data to long form, data_long
, process it to give means.long
and convert it back to wide form using dcast
:
library(reshape2)
data_long <- melt(data, id.vars = "date")
means_long <- sqldf("
SELECT x.id, y.variable, avg(value)
FROM events as x, data_long as y
WHERE y.date between x.start and x.stop
GROUP BY x.id, y.variable
")
means <- dcast(id ~ variable, data = means_long, value.var = "avg(value)")
Upvotes: 2
Reputation: 263342
> t( sapply(events$id , function(id) lapply (
data[ data[["date"]] >= events[ events[['id']]==id, 'start'] &
data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
mean) ) )
a b c d e
[1,] 13 23 33 43 53
[2,] 16.5 26.5 36.5 46.5 56.5
[3,] 21.5 31.5 41.5 51.5 61.5
# Or if you prefer:
t( sapply(events$id , function(id) data.frame(
id=as.character(id),
lapply (data[ data[["date"]] >= events[events[['id']]==id, 'start'] &
data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
mean) ,stringsAsFactors=FALSE) ) )
id a b c d e
[1,] "AGH" 13 23 33 43 53
[2,] "TRG" 16.5 26.5 36.5 46.5 56.5
[3,] "ZUH" 21.5 31.5 41.5 51.5 61.5
Upvotes: 1