Reputation: 3021

using lead or lag from dplyr in combination with other variables

I have a dataframe:

                 Time   CardID    Data      Type
1  2018-01-01 10:44:35 10037479 PowerOn  STBEvent
2  2018-01-01 10:44:48 10037479    0401 UseRemote
3  2018-01-01 10:44:53 10037479    0301 UseRemote
4  2018-01-01 10:45:13 10037479    0401 UseRemote
5  2018-01-01 10:45:24 10037479    0301 UseRemote
6  2018-01-01 10:45:30 10037479    1415  LiveView
7  2018-01-01 10:45:37 10037479    0401 UseRemote
8  2018-01-01 11:08:01 10037479    1412  LiveView
9  2018-01-01 11:08:13 10037479    0401 UseRemote
10 2018-01-01 11:14:31 10037479    0301 UseRemote

structure(list(Time = structure(c(1514783675, 1514783688, 1514783693, 
1514783713, 1514783724, 1514783730, 1514783737, 1514785081, 1514785093, 
1514785471), class = c("POSIXct", "POSIXt")), CardID = c("10037479", 
"10037479", "10037479", "10037479", "10037479", "10037479", "10037479", 
"10037479", "10037479", "10037479"), Data = c("PowerOn", "0401", 
"0301", "0401", "0301", "1415", "0401", "1412", "0401", "0301"
), Type = c("STBEvent", "UseRemote", "UseRemote", "UseRemote", 
"UseRemote", "LiveView", "UseRemote", "LiveView", "UseRemote", 
"UseRemote")), .Names = c("Time", "CardID", "Data", "Type"), row.names = c(NA, 
10L), class = "data.frame")

I am using lead and lag functions from dplyr to fetch data points before and after a particular row. For example I am using this:

ae1 <- which(dplyr::lag(df$Data)=="1415")+1

This gets me row number 6 from the above data frame where the Type is equal to "LiveView". I understand by altering the integer at the end of the code, I could get the respective line from the data frame. My question is: can I use the same / similar function wherein I can fetch data the next "LiveView" - which is row number 8. I can very well do

ae1 <- which(dplyr::lag(df$Data)=="1415")+3

to get row number 8. But the next LiveView type can occur in any row other than 8. I am thinking something similar to

ae1 <- which(dplyr::lag(df$Data)=="1415")+nrow(where Type == next "LiveView")

Upvotes: 1

Answers (3)

Frank

Reputation: 66819

I am using lead and lag functions from dplyr to fetch data points before and after a particular row. [...] can I use the same / similar function wherein I can fetch data the next "LiveView" [?]

If for each instance of Data == 1415, you want to find the next row where CardID matches, Type matches, and Time is greater, then there's...

library(data.table)
setDT(df)

mdf = df[Data == "1415", .(CardID, Type, Time)]
w   = df[mdf, on=.(CardID, Type, Time > Time), mult="first", which=TRUE]
df[w]

#                   Time   CardID Data     Type
# 1: 2018-01-01 00:38:01 10037479 1412 LiveView

If you have duplicate times, then Time won't work as a row number. You can add a row number like df[, rn := .I] or df[, rn := rowid(CardID)] and use that instead.

The line with on= is a non-equi join, currently not available in dplyr, which is why I'm posting with a different package here.

If you want to return both rows....

w0 = df[Data == "1415", which=TRUE]
w  = df[df[w0], on=.(CardID, Type, Time > Time), mult="first", which=TRUE]
df[matrix(c(w0, w), 2, byrow=TRUE)]

#                   Time   CardID Data     Type
# 1: 2018-01-01 00:15:30 10037479 1415 LiveView
# 2: 2018-01-01 00:38:01 10037479 1412 LiveView

Or if you also want data points before the row:

wb = df[df[w0], on=.(CardID, Type, Time < Time), mult="first", which=TRUE]
df[matrix(c(wb, w0, w), 3, byrow=TRUE)]

#                   Time   CardID Data     Type
# 1:                <NA>     <NA> <NA>     <NA>
# 2: 2018-01-01 00:15:30 10037479 1415 LiveView
# 3: 2018-01-01 00:38:01 10037479 1412 LiveView

NAs are shown since no row meets those criteria.

Upvotes: 1

G. Grothendieck

Reputation: 270248

1) If the objective is to find the row number of the first LiveWire row past the first 1415 row then use the conjunction shown followed by which and first to get the row numbers and the first of those row numbers. Note that cummany is true for the first Data component of 1415 onwards and by lagging it we get true only for those after it to the end. If we knew that the there were only one such row then we could omit the first. Since dplyr's lag conflicts with lag in the base we use dplyr::lag to be sure we are using the desired one.

df %>% 
  { dplyr::lag(cumany(.$Data == 1415)) & .$Type == "LiveView" } %>%
  which %>%
  first
  ## [1] 8

2) If we wanted the row itself instead then use filter and slice. If we knew there were only one such row we could omit the slice:

df %>% 
  filter(dplyr::lag(cumany(Data == 1415)) & Type == "LiveView") %>%
  slice(1)
##                  Time   CardID Data     Type
## 1 2018-01-01 00:38:01 10037479 1412 LiveView

Note that if we added a row number to df by replacing the first line of code with:

df %>% mutate(n = 1:n()) %>%

then the above code would also give the row number in the n column in addition to the row itself in the other columns.

2a) An alternative to (2) is that we could first filter by cumany(Data == 1415) to give all rows from the first 1415 row onwards and then remove the first row since we only want rows after it and then find the LiveView rows within that and take the first.

df %>% 
  filter(cumany(Data == 1415)) %>%
  slice(-1) %>%
  filter(Type == "LiveView") %>%
  slice(1)
##                  Time   CardID Data     Type
## 1 2018-01-01 00:38:01 10037479 1412 LiveView

Update

Revised.

Upvotes: 1

Jonny

Reputation: 2793

One way to do this would be to dplyr::group_by the data based on Type, dplyr::filter the Type of interest, and then dplyr::slice to the position you want, in this case, position 2:

library(dplyr)

df <- 
  structure(
    list(
      Time = 
        structure(c(1514783675, 1514783688, 1514783693, 
                    1514783713, 1514783724, 1514783730, 1514783737, 1514785081, 1514785093, 
                    1514785471), class = c("POSIXct", "POSIXt")), 
      CardID = c("10037479", "10037479", "10037479", "10037479", "10037479", "10037479", "10037479", 
                 "10037479", "10037479", "10037479"), 
      Data = c("PowerOn", "0401", "0301", "0401", "0301", "1415", "0401", "1412", "0401", "0301"), 
      Type = c("STBEvent", "UseRemote", "UseRemote", "UseRemote", 
               "UseRemote", "LiveView", "UseRemote", "LiveView", "UseRemote", 
               "UseRemote")), 
    .Names = c("Time", "CardID", "Data", "Type"), 
    row.names = c(NA, 10L), 
    class = "data.frame")


df %>% 
group_by(Type) %>% 
filter(Type %in% 'LiveView') %>% 
slice(2)

Upvotes: 0

using lead or lag from dplyr in combination with other variables

Answers (3)

Update

Related Questions