FlyingDutch
FlyingDutch

Reputation: 1152

dplyr mutate function to evaluate values within columns (current, previous, next) vertically

I have scoured SO for a way to achieve what I need without luck so here it goes. A while back I discovered the package dplyr and its potential. I am thinking this package can do what I want, I just don't know how. This is a small subset of my data, but should be representative of my problem.

    dummy<-structure(list(time = structure(1:20, .Label = c("2015-03-25 12:24:00", 
    "2015-03-25 21:08:00", "2015-03-25 21:13:00", "2015-03-25 21:47:00", 
    "2015-03-26 03:08:00", "2015-04-01 20:30:00", "2015-04-01 20:34:00", 
    "2015-04-01 20:42:00", "2015-04-01 20:45:00", "2015-09-29 18:26:00", 
    "2015-09-29 19:11:00", "2015-09-29 21:21:00", "2015-09-29 22:03:00", 
    "2015-09-29 22:38:00", "2015-09-30 00:48:00", "2015-09-30 01:38:00", 
    "2015-09-30 01:41:00", "2015-09-30 01:45:00", "2015-09-30 01:47:00", 
    "2015-09-30 01:49:00"), class = "factor"), ID = c(1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L), station = c(1L, 1L, 1L, 2L, 3, 
    4L, 4L, 4L, 4L, 5L, 5L, 6L, 
    6L, 5, 5, 5L, 7, 7, 7L, 
    7)), .Names = c("time", "ID", "station"), class = "data.frame", row.names = c(NA, 
    -20L))

I wish to evaluate rows within the time column conditional on the ID and station column. Specifically, I would like the function (dplyr?) to evaluate each time row, and compare the time to the previous time (row-1) and next time (row+1). If the time of current row is within 1 hour of time of previous and/or next row, and the ID and station of current row match that of previous and/or next row, then I would like to add in a new row a 1, otherwise a 0.

How would I achieve this using dplyr?

The expected outcome should be like this:

                  time ID station new.value
1  2015-03-25 12:24:00  1       1         0
2  2015-03-25 21:08:00  1       1         1
3  2015-03-25 21:13:00  1       1         1
4  2015-03-25 21:47:00  1       2         0
5  2015-03-26 03:08:00  1       3         0
6  2015-04-01 20:30:00  1       4         1
7  2015-04-01 20:34:00  1       4         1
8  2015-04-01 20:42:00  1       4         1
9  2015-04-01 20:45:00  1       4         1
10 2015-09-29 18:26:00  2       5         1
11 2015-09-29 19:11:00  2       5         1
12 2015-09-29 21:21:00  2       6         1
13 2015-09-29 22:03:00  2       6         1
14 2015-09-29 22:38:00  2       5         0
15 2015-09-30 00:48:00  2       5         1
16 2015-09-30 01:38:00  2       5         1
17 2015-09-30 01:41:00  2       7         1
18 2015-09-30 01:45:00  2       7         1
19 2015-09-30 01:47:00  2       7         1
20 2015-09-30 01:49:00  2       7         1

Upvotes: 4

Views: 946

Answers (3)

989
989

Reputation: 12937

Another solution using R base functions (sapply and difftime):

n=nrow(dummy)
dummy$new.value=
as.numeric(sapply(1:n, function(i) 
(i<n && (dummy[i,"ID"]==dummy[i+1,"ID"] && dummy[i,"station"]==dummy[i+1,"station"]) 
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i+1,"time"]), "hours"))<=1) 
|| 
(i>1 && (dummy[i,"ID"]==dummy[i-1,"ID"] && dummy[i,"station"]==dummy[i-1,"station"]) 
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i-1,"time"]), "hours"))<=1) 
))

# > dummy
                  # time ID station new.value
# 1  2015-03-25 12:24:00  1       1         0
# 2  2015-03-25 21:08:00  1       1         1
# 3  2015-03-25 21:13:00  1       1         1
# 4  2015-03-25 21:47:00  1       2         0
# 5  2015-03-26 03:08:00  1       3         0
# 6  2015-04-01 20:30:00  1       4         1
# 7  2015-04-01 20:34:00  1       4         1
# 8  2015-04-01 20:42:00  1       4         1
# 9  2015-04-01 20:45:00  1       4         1
# 10 2015-09-29 18:26:00  2       5         1
# 11 2015-09-29 19:11:00  2       5         1
# 12 2015-09-29 21:21:00  2       6         1
# 13 2015-09-29 22:03:00  2       6         1
# 14 2015-09-29 22:38:00  2       5         0
# 15 2015-09-30 00:48:00  2       5         1
# 16 2015-09-30 01:38:00  2       5         1
# 17 2015-09-30 01:41:00  2       7         1
# 18 2015-09-30 01:45:00  2       7         1
# 19 2015-09-30 01:47:00  2       7         1
# 20 2015-09-30 01:49:00  2       7         1

Upvotes: 3

akuiper
akuiper

Reputation: 214977

Here is an option using the difftime with dplyr mutate function. Firstly, we use a group_by operation to make sure the comparison is within each unique combination of ID and Station. The difftime can be used to calculate the difference time, here the units will be set as hours for convenience. The lag and lead functions are also from dplyr package which shift the selected column backward or forward. Combining with the vectorised operation of difftime, you can calculate the time difference between the current row and the previous/next row. We use abs to make sure the result is absolute value. The condition of <1 make sure the difference is within an hour. as.integer convert the logical values (T or F) to (1 or 0) correspondingly.

library(dplyr)
dummy %>% group_by(ID, station) %>% 
          mutate(new.value = as.integer(
                 abs(difftime(time, lag(time, default = Inf), units = "hours")) < 1 | 
                 abs(difftime(time, lead(time, default = Inf), units = "hours")) < 1))

Source: local data frame [20 x 4]
Groups: ID, station [7]

                  time    ID station new.value
                (time) (int)   (dbl)     (int)
1  2015-03-25 12:24:00     1       1         0
2  2015-03-25 21:08:00     1       1         1
3  2015-03-25 21:13:00     1       1         1
4  2015-03-25 21:47:00     1       2         0
5  2015-03-26 03:08:00     1       3         0
6  2015-04-01 20:30:00     1       4         1
7  2015-04-01 20:34:00     1       4         1
8  2015-04-01 20:42:00     1       4         1
9  2015-04-01 20:45:00     1       4         1
10 2015-09-29 18:26:00     2       5         1
11 2015-09-29 19:11:00     2       5         1
12 2015-09-29 21:21:00     2       6         1
13 2015-09-29 22:03:00     2       6         1
14 2015-09-29 22:38:00     2       5         0
15 2015-09-30 00:48:00     2       5         1
16 2015-09-30 01:38:00     2       5         1
17 2015-09-30 01:41:00     2       7         1
18 2015-09-30 01:45:00     2       7         1
19 2015-09-30 01:47:00     2       7         1
20 2015-09-30 01:49:00     2       7         1

Upvotes: 5

C8H10N4O2
C8H10N4O2

Reputation: 19005

Psidom's answer is great -- here's a data.table approach.

library(data.table)
setDT(dummy)
# you do NOT want a factor for your time variable
dummy[, time := as.POSIXct(time) ]
dummy[, `:=`(lag_diff = c(Inf, diff(as.numeric(time))),
             lead_diff = c(diff(as.numeric(time)), Inf)),
      by = .(ID, station) ]
dummy[, new.value := as.integer(lag_diff < 3600 | lead_diff < 3600) ]
dummy

Upvotes: 5

Related Questions