Reputation: 91

R: how to average within overlapping time periods

I have recently posted a similar question here, which was a bit oversimplified, though. Thus here we go again:

Assume I have a dataframe (dput output below) with time series data of many different variables (5 in this example, many more in the real data):

          date          a  b  c  d  e
1  2009-10-01 00:00:00 10 20 30 40 50
2  2009-10-01 01:00:00 11 21 31 41 51
3  2009-10-01 02:00:00 12 22 32 42 52
4  2009-10-01 03:00:00 13 23 33 43 53
5  2009-10-01 04:00:00 14 24 34 44 54
6  2009-10-01 05:00:00 15 25 35 45 55
7  2009-10-01 06:00:00 16 26 36 46 56
8  2009-10-01 07:00:00 17 27 37 47 57
9  2009-10-01 08:00:00 18 28 38 48 58
10 2009-10-01 09:00:00 19 29 39 49 59
11 2009-10-01 10:00:00 20 30 40 50 60
12 2009-10-01 11:00:00 21 31 41 51 61
13 2009-10-01 12:00:00 22 32 42 52 62
14 2009-10-01 13:00:00 23 33 43 53 63
15 2009-10-01 14:00:00 24 34 44 54 64
16 2009-10-01 15:00:00 25 35 45 55 65
17 2009-10-01 16:00:00 26 36 46 56 66
18 2009-10-01 17:00:00 27 37 47 57 67
19 2009-10-01 18:00:00 28 38 48 58 68
20 2009-10-01 19:00:00 29 39 49 59 69
21 2009-10-01 20:00:00 30 40 50 60 70
22 2009-10-01 21:00:00 31 41 51 61 71
23 2009-10-01 22:00:00 32 42 52 62 72
24 2009-10-01 23:00:00 33 43 53 63 73
25 2009-10-02 00:00:00 34 44 54 64 74

and another data frame "events" with different time periods defined by a start and stop date (3 here, many more in the real data):

   id       start                stop
1 AGH 2009-10-01 02:00:00 2009-10-01 04:00:00
2 TRG 2009-10-01 03:00:00 2009-10-01 10:00:00
3 ZUH 2009-10-01 03:00:00 2009-10-01 20:00:00

I would like to get a table of the mean values of the variables within the different events like this:

   id avg(y.a) avg(y.b) avg(y.c) avg(y.d) avg(y.e)
1 AGH     13.0     23.0     33.0     43.0     53.0
2 TRG     16.5     26.5     36.5     46.5     56.5
3 ZUH     21.5     31.5     41.5     51.5     61.5

I have learned from my previous post that I can do this using the sqldf package and a rather simple SQL statement:

means <- sqldf("
+     SELECT x.id, avg(y.a), avg(y.b), avg(y.c), avg(y.d), avg(y.e) 
+     FROM events as x, data as y 
+     WHERE y.date between x.start and x.stop 
+     GROUP BY x.id 
+ ")

However, as the real data contains many more columns to average, which are named differently in the various files I have to process, typing all the column names into the SQL statements becomes a bit tedious.

Thus I would prefer a solution in R, where I can simply refer to the columns by their number (data[2:100]) The difficulty is, though, that the time periods are non-continous and overlapping and the ids are character strings.

Any ideas how to do this would be much appreciated!

dput(data)

structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), a = 10:34, b = 20:44, c = 30:54, 
    d = 40:64, e = 50:74), .Names = c("date", "a", "b", "c", 
"d", "e"), row.names = c(NA, -25L), class = "data.frame")

dput(events)

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")

Upvotes: 3

Answers (2)

G. Grothendieck

Reputation: 269596

The basic problem is due to the fact that the data is not normalized; however, short of putting it into long form we could dynamically generate the sql statement:

library(sqldf)
sql <- paste("select id, ", 
    toString(sprintf("avg(y.%s)", names(data)[-1])),
    "from events as x, data as y
    where y.date between x.start and x.stop
    group by x.id")
sqldf(sql)

As an alternative, we show the use of melt in the reshape2 package to convert the data to long form, data_long, process it to give means.long and convert it back to wide form using dcast :

library(reshape2)
data_long <- melt(data, id.vars = "date")
means_long <- sqldf("
     SELECT x.id, y.variable, avg(value)
     FROM events as x, data_long as y 
     WHERE y.date between x.start and x.stop 
     GROUP BY x.id, y.variable
")
means <- dcast(id ~ variable, data = means_long, value.var = "avg(value)")

Upvotes: 2

IRTFM

Reputation: 263342

>  t( sapply(events$id , function(id) lapply (
            data[ data[["date"]] >= events[ events[['id']]==id, 'start'] & 
                  data[["date"]] <= events[ events[['id']]==id, 'stop' ] ,  -1 ], 
            mean) ) )
     a    b    c    d    e   
[1,] 13   23   33   43   53  
[2,] 16.5 26.5 36.5 46.5 56.5
[3,] 21.5 31.5 41.5 51.5 61.5
#  Or if you prefer:
t( sapply(events$id , function(id) data.frame( 
                       id=as.character(id), 
                       lapply (data[ data[["date"]] >= events[events[['id']]==id, 'start'] &  
                                     data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
                               mean) ,stringsAsFactors=FALSE) ) )
     id    a    b    c    d    e   
[1,] "AGH" 13   23   33   43   53  
[2,] "TRG" 16.5 26.5 36.5 46.5 56.5
[3,] "ZUH" 21.5 31.5 41.5 51.5 61.5

Upvotes: 1

R: how to average within overlapping time periods

Answers (2)

Related Questions