RUser
RUser

Reputation: 598

Consecutive event analysis in R

I need help with this problem please. I have checked various other posts, but I cannot piece it together. I have data, about 100,000 records of athletes and training events they participated in. I simplified the data, but the approach would suit the entire data set.

Code for the data.frame:

# Fictitious data
days <- seq(as.Date("2016/01/01"), as.Date("2016/01/28"), "days")
events <- c("Run","Swim","Swim","Cycle","Rest","Gym","Swim","Run",
  "Cycle","Run","Swim","Swim","Run","Swim","Cycle","Rest","Gym",
  "Swim","Swim","Swim","Run","Swim","Run","Gym","Rest","Gym",
  "Cycle","Swim")
my.data <- data.frame(athlete = 1, days,events)
# Note - This data repeats for many participants, but I did not include more than 1

I need to flag athletes who completed at least 3 swim events per week for at least 2 consecutive weeks.

Edit: I did not think this through properly. Let's make this a bit more complicated. Let's say we work with running weeks, i.e. a group of 7 days and not calendar weeks, starting from the very first swim event for each athlete.

Update: I have yet another challenge, let's say I want to just look for a pattern of 3 swim events per 5 days for at least 10 consecutive days, anywhere in the data.

Thanks

Upvotes: 1

Views: 485

Answers (2)

akuiper
akuiper

Reputation: 215117

You can do a two-step summarization, firstly calculate the number of swims for each week for each athlete and then check if there is any consecutive weeks that have more then three swims for the athlete:

library(dplyr)
library(lubridate)
my.data %>% 
      arrange(days) %>% 
      group_by(athlete, w = week(days)) %>% 
      summarise(n_swim = sum(events == "Swim")) %>% 
      group_by(athlete) %>% 
      summarise(flag = any(diff(w[n_swim >= 3]) == 1))

# A tibble: 1 x 2
#  athlete  flag
#    <dbl> <lgl>
#1       1  TRUE

Update: to set the week starting from the first Swim, use which.max() to find out the index where the first Swim appears and then subtract all days by this day to get day difference, then if you do modulo(7) calculation, the week number will start from this day:

my.data %>% 
        arrange(days) %>% group_by(athlete) %>% 
        mutate(Swim = events == "Swim", 
               w = as.integer(days - days[which.max(Swim)]) %/% 7) %>%
        # the first swim day is set as zero, a modulo of 7 will give week number 
        # starting from this day            

        group_by(w, add = TRUE) %>% 
        summarise(n_swim = sum(Swim)) %>% 
        group_by(athlete) %>% 
        summarise(flag = any(diff(w[n_swim >= 3]) == 1))

# A tibble: 1 x 2
#  athlete  flag
#    <dbl> <lgl>
#1       1  TRUE

Upvotes: 4

Rahul
Rahul

Reputation: 2779

Quick and dirty code, but check if it works for your dataset:

library(tidyverse)
library(lubridate)

df %>% 
    mutate(weeknum=week(days)) %>% 
    group_by(athlete,weeknum) %>% 
    filter(events=='Swim') %>% 
    summarise(n=n()) %>% 
    mutate(gt_3=as.numeric(n>=3),
           x=gt_3-lag(gt_3,1),
           flag=x==0) %>% 
    filter(flag==T) %>% 
    select(athlete) %>% 
    distinct()

Upvotes: 1

Related Questions