svnnf
svnnf

Reputation: 35

Calculate difference between two values in grouped sequences

This is a follow-up question for this post: Loop through dataframe in R and measure time difference between two values

I already got excellent help with the following code to calculate the time difference in minutes between a certain Stimuli and the next Response:

df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>%
  arrange(User,Date)%>%
  mutate(difftime= difftime(lead(Date),Date, units = "mins") ) %>%
  group_by(User)%>%
  filter((StimuliA==1 | StimuliB==1) & lead(Responses)==1)`

Dataset:

structure(list(User = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L), Date = c("24.11.2015 20:39", 
"25.11.2015 11:42", "11.01.2016 22:46", "26.11.2015 22:42", "04.03.2016 05:45", 
"24.11.2015 13:13", "25.11.2015 13:59", "27.11.2015 12:18", "28.05.2016 06:49", 
"06.07.2016 09:46", "03.12.2015 09:32", "07.12.2015 08:18", "08.12.2015 19:40", 
"08.12.2015 19:40", "22.12.2015 08:50", "22.12.2015 08:52", "22.12.2015 08:52", 
"22.12.2015 20:46"), StimuliA = c(1L, 0L, 0L, 1L, 1L, 1L, 0L, 
1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), StimuliB = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 
0L), Responses = c(0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 
0L, 1L, 0L, 1L, 1L, 1L, 1L)), .Names = c("User", "Date", "StimuliA", 
"StimuliB", "Responses"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -18L), spec = structure(list(cols = structure(list(
    User = structure(list(), class = c("collector_integer", "collector"
    )), Date = structure(list(), class = c("collector_character", 
    "collector")), StimuliA = structure(list(), class = c("collector_integer", 
    "collector")), StimuliB = structure(list(), class = c("collector_integer", 
    "collector")), Responses = structure(list(), class = c("collector_integer", 
    "collector"))), .Names = c("User", "Date", "StimuliA", "StimuliB", 
"Responses")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Goal/Question The lead arugment helped to determine the time difference between a Stimuli == 1 (A or B) and the next response [sorted by date/time] (Response == 1). How would i change that code to find the time difference between the Stimuli A or B and the LAST Response in this sequence. (until the next Stimuli occurs)

Desired output:

User    Date        StimuliA    StimuliB    Responses   time diff  Seq_ID
1   24.11.2015 20:39    1           0           0                  1_1_0
1   25.11.2015 11:42    0           0           1                  1_1_0
1   11.01.2016 22:46    0           0           1       69247      1_1_0
2   26.11.2015 22:42    1           0           0                  2_1_0
2   04.03.2016 05:45    0           1           0                  2_1_1
3   24.11.2015 13:13    1           0           0                  3_1_0
3   25.11.2015 13:59    0           0           1       1486       3_1_0
3   27.11.2015 12:18    1           0           0                  3_2_0
3   28.05.2016 06:49    0           0           1                  3_2_0
3   06.07.2016 09:46    0           0           1       319528     3_2_0
4   03.12.2015 09:32    1           0           0                  4_1_0
4   07.12.2015 08:18    1           0           0                  4_2_0
4   08.12.2015 19:40    0           0           1       2122       4_1_0
4   08.12.2015 19:40    0           1           0                  4_2_1
4   22.12.2015 08:50    0           0           1       19510      4_2_1
5   22.12.2015 08:52    0           0           1                  5_0_0
5   22.12.2015 08:52    0           0           1                  5_0_0
5   22.12.2015 20:46    0           0           1                  5_0_0

For Stimuli A this meant the values c(69247, 319528, 2122) and B c(1486, 19510).

Upvotes: 0

Views: 603

Answers (1)

mikeck
mikeck

Reputation: 3788

Try this.

# df$Date <- as.POSIXct(strptime(df$Date,"%d.%m.%Y %H:%M"))
df %>% 
  arrange(User, Date) %>% 
  group_by(User) %>%
  mutate(
    last.date = Date[which(StimuliA == 1L)[c(1,1:sum(StimuliA == 1L))][cumsum(StimuliA == 1L)+ 1]]
  ) %>%
  mutate(
    timesince = ifelse(Responses == 1L, Date - last.date, NA)
  )

This works by first creating a column that records the data of last stimuli, and then using ifelse and lag to get the difference between the current date and the last stimuli date. You can filter to extract only the LAST response.

There is a cleaner way to do the "last.date" operation with zoo.na.locf, but I didn't want to assume you were ok with another package dependency.

EDIT To identify the sequence (if I correctly understand what you mean by "sequence"), continue the chain with

%>% mutate(sequence = cumsum(StimuliA))

to identify sequences defined as observations following a positive Stimuli. To filter out the last response of a sequence, continue the chain with

%>% group_by(User, sequence) %>%
  filter(timesince == max(timesince, na.rm = TRUE))

to group by sequence (and user) and then extract the maximum time difference associated with each sequence (which will correspond to the last positive response of a sequence).

Upvotes: 2

Related Questions