Reputation: 598
I need help with this problem please. I have checked various other posts, but I cannot piece it together. I have data, about 100,000 records of athletes and training events they participated in. I simplified the data, but the approach would suit the entire data set.
Code for the data.frame:
# Fictitious data
days <- seq(as.Date("2016/01/01"), as.Date("2016/01/28"), "days")
events <- c("Run","Swim","Swim","Cycle","Rest","Gym","Swim","Run",
"Cycle","Run","Swim","Swim","Run","Swim","Cycle","Rest","Gym",
"Swim","Swim","Swim","Run","Swim","Run","Gym","Rest","Gym",
"Cycle","Swim")
my.data <- data.frame(athlete = 1, days,events)
# Note - This data repeats for many participants, but I did not include more than 1
I need to flag athletes who completed at least 3 swim events per week for at least 2 consecutive weeks.
Edit: I did not think this through properly. Let's make this a bit more complicated. Let's say we work with running weeks, i.e. a group of 7 days and not calendar weeks, starting from the very first swim event for each athlete.
Update: I have yet another challenge, let's say I want to just look for a pattern of 3 swim events per 5 days for at least 10 consecutive days, anywhere in the data.
Thanks
Upvotes: 1
Views: 485
Reputation: 215117
You can do a two-step summarization, firstly calculate the number of swims for each week for each athlete and then check if there is any consecutive weeks that have more then three swims for the athlete:
library(dplyr)
library(lubridate)
my.data %>%
arrange(days) %>%
group_by(athlete, w = week(days)) %>%
summarise(n_swim = sum(events == "Swim")) %>%
group_by(athlete) %>%
summarise(flag = any(diff(w[n_swim >= 3]) == 1))
# A tibble: 1 x 2
# athlete flag
# <dbl> <lgl>
#1 1 TRUE
Update: to set the week starting from the first Swim, use which.max()
to find out the index where the first Swim
appears and then subtract all days by this day to get day difference, then if you do modulo(7) calculation, the week number will start from this day:
my.data %>%
arrange(days) %>% group_by(athlete) %>%
mutate(Swim = events == "Swim",
w = as.integer(days - days[which.max(Swim)]) %/% 7) %>%
# the first swim day is set as zero, a modulo of 7 will give week number
# starting from this day
group_by(w, add = TRUE) %>%
summarise(n_swim = sum(Swim)) %>%
group_by(athlete) %>%
summarise(flag = any(diff(w[n_swim >= 3]) == 1))
# A tibble: 1 x 2
# athlete flag
# <dbl> <lgl>
#1 1 TRUE
Upvotes: 4
Reputation: 2779
Quick and dirty code, but check if it works for your dataset:
library(tidyverse)
library(lubridate)
df %>%
mutate(weeknum=week(days)) %>%
group_by(athlete,weeknum) %>%
filter(events=='Swim') %>%
summarise(n=n()) %>%
mutate(gt_3=as.numeric(n>=3),
x=gt_3-lag(gt_3,1),
flag=x==0) %>%
filter(flag==T) %>%
select(athlete) %>%
distinct()
Upvotes: 1