Reputation: 5
Let's assume I have a dataset similar to this:
library(tidyverse)
country <- c(rep("Germany", 9), rep("Greece", 9), rep("Turkey", 9), rep("Austria", 9))
date <- rep(seq(ymd('2012-04-07'),ymd('2012-04-15'),by='days'), 4)
gyros_eaten <- c(lag(1:3, n = 2, default = 0), floor(runif(6, min=1, max=4)),
lag(1:6, n = 5, default = 0), floor(runif(3, min=1, max=4)),
lag(1:2, n = 1, default = 0), floor(runif(7, min=1, max=4)),
lag(1:3, n = 2, default = 0), floor(runif(6, min=1, max=4)))
df <- data.frame(country, date, gyros_eaten)
I want to create a new column, name it days_since_first_local_gyros
, which would tell me how many days have passed since eating the first gyros, in that particular country. That is, it would take the value of zero if no gyros was eaten yet, or if on that day I ate my first gyros in e.g. Greece. After that, it would run from 1 to infinity, increasing by 1 on each day. Hence each group (i.e. country) would have a distinct index, and I have rows for each possible country x day dyads in my dataset, within a particular time range.
How should I create this column? I assume it's some combination of case when
, group_by
and lag
, but i just can't wrap my head around it.
Upvotes: 0
Views: 259
Reputation: 388807
We could get the first date
where gyros_eaten > 0
and subtract all the dates from that date in the country
. We use pmax
to keep max value of 0 in days_since_first_local_gyros
.
library(dplyr)
df %>%
arrange(country, date) %>%
group_by(country) %>%
mutate(days_since_first_local_gyros = pmax(as.integer(date -
date[which.max(gyros_eaten > 0)]), 0))
# country date gyros_eaten days_since_first_local_gyros
# <chr> <date> <dbl> <dbl>
# 1 Austria 2012-04-07 0 0
# 2 Austria 2012-04-08 0 0
# 3 Austria 2012-04-09 1 0
# 4 Austria 2012-04-10 1 1
# 5 Austria 2012-04-11 1 2
# 6 Austria 2012-04-12 2 3
# 7 Austria 2012-04-13 2 4
# 8 Austria 2012-04-14 2 5
# 9 Austria 2012-04-15 3 6
#10 Germany 2012-04-07 0 0
# … with 26 more rows
Upvotes: 1