Anagha
Anagha

Reputation: 3699

Averaging the value with respect to time

I have the below dataset with date-time and the corresponding value. The time interval is every 10 mins. I need to generate new rows with 15 mins interval.

For example, for 15:40 the value is 599 and for 15:50 the value is 594, so a new row needs to be generated between the two, i.e 15:45 with average of 599 & 594 which is 596.5

I.e, I need to generate an average between 10 & 20 to get the value for say 16:15; and 40 & 50 to get the value for 16:45. The value for 00, 30 remains the same

Date...Time       RA.CO2
6/15/2017 15:40   599
6/15/2017 15:50   594
6/15/2017 16:00   606
6/15/2017 16:10   594
6/15/2017 16:20   594
6/15/2017 16:30   594
6/15/2017 16:40   594
6/15/2017 16:50   594
6/16/2017 0:00    496.25
6/16/2017 0:10    500
6/16/2017 0:20    496.25
6/16/2017 0:30    496.25
6/16/2017 0:40    600
6/16/2017 0:50    650
6/16/2017 1:00    700 

str(df)
'data.frame':   6092 obs. of  2 variables:
 $ Date...Time: chr  "6/15/2017 15:40" "6/15/2017 15:50" "6/15/2017 16:00" 
 "6/15/2017 16:10" ...
$ RA.CO2     : num  599 594 606 594 594 594 594 594 594 594 ...

Output

 Date...Time          RA.CO2
 6/15/2017 15:45      596.5
 6/15/2017 16:00      606
 6/15/2017 16:15      594
 6/15/2017 16:30      594
 6/15/2017 16:45      594
 6/16/2017 0:00       496.25
 6/16/2017 0:15       498.125
 6/16/2017 0:30       496.25
 6/16/2017 0:45       625
 6/16/2017 1:00       700

Upvotes: 2

Views: 162

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 270055

Here are some solutions. I have re-read the question and am assuming that new intermediate times should only be inserted before times that are 20 or 50 minutes after the hour and in both cases the immediately prior time (before inserting the intermediate time) must be 10 minutes previous. If that is not the intention of the question then it, the vector of intermediate times, will need to be changed from what is shown.

1) zoo Merge df with a data frame having the intermediate times it and then run na.approx from the zoo package on the RA column to fill in the NA values:

library(zoo)

it <- with(df, DT[c(FALSE, diff(DT) == 10) & as.POSIXlt(DT)$min %in% c(20, 50)] - 5 * 60)
M <- merge(df, data.frame(DT = it), all = TRUE)
transform(M, RA = na.approx(RA))

giving:

                    DT     RA
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 16:00:00 606.00
5  2017-06-15 16:10:00 594.00
6  2017-06-15 16:15:00 594.00
7  2017-06-15 16:20:00 594.00
8  2017-06-15 16:30:00 594.00
9  2017-06-15 16:40:00 594.00
10 2017-06-15 16:45:00 594.00
11 2017-06-15 16:50:00 594.00
12 2017-06-16 00:00:00 496.25
13 2017-06-16 00:10:00 496.25
14 2017-06-16 00:15:00 496.25
15 2017-06-16 00:20:00 496.25
16 2017-06-16 00:30:00 496.25

1a) Note that if df were converted to zoo, i.e. z <- read.zoo(df, tz = ""), then this could be written as just this giving a zoo object result:

na.approx(merge(z, zoo(, it)))

2) approx This one uses no packages. it is from above.

with(df, data.frame(approx(DT, RA, xout = sort(c(DT, it)))))

giving:

                     x      y
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 16:00:00 606.00
5  2017-06-15 16:10:00 594.00
6  2017-06-15 16:15:00 594.00
7  2017-06-15 16:20:00 594.00
8  2017-06-15 16:30:00 594.00
9  2017-06-15 16:40:00 594.00
10 2017-06-15 16:45:00 594.00
11 2017-06-15 16:50:00 594.00
12 2017-06-16 00:00:00 496.25
13 2017-06-16 00:10:00 496.25
14 2017-06-16 00:15:00 496.25
15 2017-06-16 00:20:00 496.25
16 2017-06-16 00:30:00 496.25

Note: The input used for the above is:

df <- structure(list(DT = structure(c(1497555600, 1497556200, 1497556800, 
1497557400, 1497558000, 1497558600, 1497559200, 1497559800, 1497585600, 
1497586200, 1497586800, 1497587400), class = c("POSIXct", "POSIXt"
)), RA = c(599, 594, 606, 594, 594, 594, 594, 594, 496.25, 496.25, 
496.25, 496.25)), .Names = c("DT", "RA"), row.names = c(NA, -12L
), class = "data.frame")

Update: Have revised assumption of which intermediate times to include.

Upvotes: 2

www
www

Reputation: 39174

We can use tidyr to expand the data frame and imputeTS to impute the missing values by linear interpolation.

library(dplyr)
library(tidyr)
library(lubridate)
library(imputeTS)

dt2 <- dt %>%
  mutate(Date...Time = mdy_hm(Date...Time)) %>%
  mutate(Date = as.Date(Date...Time)) %>%
  group_by(Date) %>%
  complete(Date...Time = seq(min(Date...Time), max(Date...Time), by = "5 min")) %>%
  mutate(RA.CO2 = na.interpolation(RA.CO2)) %>%
  ungroup() %>%
  select(Date...Time, RA.CO2)
dt2
# A tibble: 22 x 2
           Date...Time RA.CO2
                <dttm>  <dbl>
 1 2017-06-15 15:40:00  599.0
 2 2017-06-15 15:45:00  596.5
 3 2017-06-15 15:50:00  594.0
 4 2017-06-15 15:55:00  600.0
 5 2017-06-15 16:00:00  606.0
 6 2017-06-15 16:05:00  600.0
 7 2017-06-15 16:10:00  594.0
 8 2017-06-15 16:15:00  594.0
 9 2017-06-15 16:20:00  594.0
10 2017-06-15 16:25:00  594.0
# ... with 12 more rows

My output is not entirely the same as your desired output. This is because:

  1. It is not clear how do you get the values in 6/16/2017 0:10.
  2. Why sometimes the interval is 5 minutes, but sometimes it is 10 minutes?
  3. Why do you include the last three rows? It is also not clear the rules to fill the values of the last three rows.

Nevertheless, I think my solution provides you a possible way to achieve this task. You may need to adjust the code by yourself to fit those unclear rules.

Data

dt <- read.table(text = "Date...Time       RA.CO2
'6/15/2017 15:40'   599
                 '6/15/2017 15:50'   594
                 '6/15/2017 16:00'   606
                 '6/15/2017 16:10'   594
                 '6/15/2017 16:20'   594
                 '6/15/2017 16:30'   594
                 '6/15/2017 16:40'   594
                 '6/15/2017 16:50'   594
                 '6/16/2017 0:00'    496.25
                 '6/16/2017 0:10'    496.25
                 '6/16/2017 0:20'    496.25
                 '6/16/2017 0:30'    496.25",
                 header = TRUE, stringsAsFactors = FALSE)

Upvotes: 3

Sotos
Sotos

Reputation: 51582

Here is a different idea using zoo library,

library(zoo)

df1 <- df[rep(rownames(df), each = 2),]
df1$DateTime[c(FALSE, TRUE)] <- df1$DateTime[c(FALSE, TRUE)]+5*60
df1$RA.CO2[c(FALSE, TRUE)] <- rollapply(df$RA.CO2, 2, by = 2, mean)

which gives,

               DateTime RA.CO2
1    2017-06-15 15:40:00 599.00
1.1  2017-06-15 15:45:00 596.50
2    2017-06-15 15:50:00 594.00
2.1  2017-06-15 15:55:00 600.00
3    2017-06-15 16:00:00 606.00
3.1  2017-06-15 16:05:00 594.00
4    2017-06-15 16:10:00 594.00
4.1  2017-06-15 16:15:00 594.00
5    2017-06-15 16:20:00 594.00
5.1  2017-06-15 16:25:00 496.25
6    2017-06-15 16:30:00 594.00
6.1  2017-06-15 16:35:00 496.25
7    2017-06-15 16:40:00 594.00
7.1  2017-06-15 16:45:00 596.50
8    2017-06-15 16:50:00 594.00
8.1  2017-06-15 16:55:00 600.00
9    2017-06-16 00:00:00 496.25
9.1  2017-06-16 00:05:00 594.00
10   2017-06-16 00:10:00 496.25
10.1 2017-06-16 00:15:00 594.00
11   2017-06-16 00:20:00 496.25
11.1 2017-06-16 00:25:00 496.25
12   2017-06-16 00:30:00 496.25
12.1 2017-06-16 00:35:00 496.25

Upvotes: 1

Z.Lin
Z.Lin

Reputation: 29095

Here's a solution using dplyr:

library(dplyr)

df %>%
  # calculate interpolated value between each row & next row
  mutate(DT.next = lead(DT),
         RA.next = lead(RA)) %>%
  mutate(diff = difftime(DT.next, DT)) %>%
  filter(as.numeric(diff) == 10) %>% #keep only 10 min intervals
  mutate(DT.interpolate = DT + diff/2,
         RA.interpolate = (RA + RA.next) / 2) %>%
  # bind to original dataframe & sort by date
  select(DT.interpolate, RA.interpolate) %>%
  rename(DT = DT.interpolate, RA = RA.interpolate) %>%
  rbind(df) %>%
  arrange(DT)

                    DT     RA
1  2017-06-15 15:40:00 599.00
2  2017-06-15 15:45:00 596.50
3  2017-06-15 15:50:00 594.00
4  2017-06-15 15:55:00 600.00
5  2017-06-15 16:00:00 606.00
6  2017-06-15 16:05:00 600.00
7  2017-06-15 16:10:00 594.00
8  2017-06-15 16:15:00 594.00
9  2017-06-15 16:20:00 594.00
10 2017-06-15 16:25:00 594.00
11 2017-06-15 16:30:00 594.00
12 2017-06-15 16:35:00 594.00
13 2017-06-15 16:40:00 594.00
14 2017-06-15 16:45:00 594.00
15 2017-06-15 16:50:00 594.00
16 2017-06-16 00:00:00 496.25
17 2017-06-16 00:05:00 496.25
18 2017-06-16 00:10:00 496.25
19 2017-06-16 00:15:00 496.25
20 2017-06-16 00:20:00 496.25
21 2017-06-16 00:25:00 496.25
22 2017-06-16 00:30:00 496.25

Dataset:

df <- data.frame(
  DT = c(seq(from = as.POSIXct("2017-06-15 15:40"),
             to = as.POSIXct("2017-06-15 16:50"),
             by = "10 min"),
         seq(from = as.POSIXct("2017-06-16 00:00"),
             to = as.POSIXct("2017-06-16 00:30"),
             by = "10 min")),
  RA = c(599, 594, 606, rep(594, 5), rep(496.25, 4))
)

Upvotes: 1

Related Questions