Reputation: 23
This is about athletes competing in the olympia. I'm supposed to calculate the top 10 athletes that held a medal for the longest time.
for example: won in 2004, 2008, 2012 --> therefore the athlete won 3 times in a row.
I'm just learning about R and I'm loosing my mind over it.
I don't even know where to start with solving this problem.
My data is "cleaned" as far as possible: - only athletes that won a gold medal - get the actual year they've won out of string
My columns (after cleaning)
id name team year medal
1 john doe USA 2004 gold
1 john doe USA 2008 gold
1 john doe USA 2012 gold
2 marc twain GER 2016 gold
3 edgar poe FIN 2000 gold
3 edgar poe FIN 2008 gold
I've tried some things like:
mutate(won =
if_else(condition = year == year +4,
true = "won",
false = "lost"))
or something like
mutate(won =
if_else(
condition = (year + 4) == tmp_year,
true = "Following Year",
false = if_else(
condition = year == tmp_year,
true = "Actual year",
false = "No")))
Here I only get Actual Year and No as answer.
In the end, i want a table that shows me how many times an athelte won the gold medal in a row.
So for example data set it would be something like this:
id name won
1 john doe 3
2 marc twain 1
3 edgar poe 1
Edit: I'm not looking for a complete answer, more like inspiration: what functions could be interesting to look at.
Upvotes: 2
Views: 458
Reputation: 13135
Here is one option using cumsum
and dplyr::lead
with default equal to year+4 taking into account where a player can have more than one series of medals
library(dplyr)
df %>% group_by(id) %>%
mutate(flag=lead(year,default = last(year)+4)-year, won=cumsum(flag==4)) %>%
select(-flag) %>% slice(which.max(won))
# A tibble: 3 x 6
# Groups: id [3]
id name team year medal won
<int> <chr> <chr> <int> <chr> <int>
1 1 john doe USA 2012 gold 3
2 2 marc twain GER 2016 gold 1
3 3 edgar poe FIN 2008 gold 1
This can be done in a compact way with
df %>% group_by(id, name, team) %>%
mutate(yearlead = lead(year, default = year[n()]+4), yeardiff = yearlead - year) %>%
group_by( grp = rleid(case_when(yeardiff == 4 ~ as.integer(yeardiff), TRUE ~ row_number())), add = TRUE) %>%
summarise(n = n())
# A tibble: 4 x 5
# Groups: id, name, team [?]
id name team grp n
<int> <chr> <chr> <int> <int>
1 1 john doe USA 1 3
2 2 marc twain GER 1 1
3 3 edgar poe FIN 1 1
4 3 edgar poe FIN 2 1
Data (This data different from the OP dataset)
df <- structure(list(id = c(1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L), name = c("john doe", "john doe", "john doe", "marc twain", "edgar poe", "edgar poe", "edgar poe", "edgar poe", "edgar poe"),
team = c("USA", "USA", "USA", "GER", "FIN", "FIN", "FIN", "FIN", "FIN"), year = c(2004L, 2008L, 2012L, 2016L, 2000L, 2008L, 2016L, 2020L, 2024L), medal = c("gold", "gold", "gold", "gold", "gold", "gold", "gold", "gold", "gold" )), class = "data.frame", row.names = c(NA, -9L))
Upvotes: 1
Reputation: 389225
Using dplyr
we can calculate the difference in winning years of gold medals using diff
for each name
, then group_by
name
and the difference and calculate the consecutive winnings.
library(dplyr)
df %>%
group_by(name) %>%
mutate(diff = c(4,diff(year))) %>%
group_by(name, diff) %>%
summarise(count = n()) %>%
select(-diff)
# name count
# <fct> <int>
#1 edgarpoe 1
#2 edgarpoe 1
#3 johndoe 3
#4 marctwain 1
Upvotes: 1