Reputation: 175
I have a dataframe that looks like this one
state start end date treat
1 1999 2000 2001 1
1 1998 2000 2001 1
1 2000 2003 NA 0
2 2001 2002 NA 0
2 2002 2004 2003 1
2 2003 2004 2005 1
3 2002 2004 2006 1
3 2003 2004 NA 0
3 2005 2007 NA 0
I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.
In other words I want to get the following
state start end date treat result
1 1999 2000 2001 1 0
1 1998 2000 2001 1 0
1 2000 2003 NA 0 2
2 2001 2002 NA 0 0
2 2002 2004 2003 1 1
2 2003 2004 2005 1 0
3 2002 2004 2006 1 0
3 2003 2004 NA 0 0
3 2005 2008 NA 0 1
For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).
Thank you very much for your help.
Upvotes: 0
Views: 135
Reputation: 39657
You can split
by state and combine two outer
with &
testing if date is between start and end and then sum
treat for those matching dates.
x$result <- unlist(lapply(split(x, x$state), function(y) {
tt <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
tt[is.na(tt)] <- TRUE
apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
# state start end date treat result
#1 1 1999 2000 2001 1 0
#2 1 1998 2000 2001 1 0
#3 1 2000 2003 NA 0 2
#4 2 2001 2002 NA 0 0
#5 2 2002 2004 2003 1 1
#6 2 2003 2004 2005 1 0
#7 3 2002 2004 2006 1 0
#8 3 2003 2004 NA 0 0
#9 3 2005 2007 NA 0 1
Or you take the part describing the treat per state and date and merge
it with the part describing state, start and end and sum the matching treat.
tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date] <- 0
aggregate(treat ~ start + end + state, tt, sum)
# start end state treat
#1 1998 2000 1 0
#2 1999 2000 1 0
#3 2000 2003 1 2
#4 2001 2002 2 0
#5 2002 2004 2 1
#6 2003 2004 2 0
#7 2002 2004 3 0
#8 2003 2004 3 0
#9 2005 2007 3 1
Upvotes: 2
Reputation: 3388
This gives your numbers though it repeats them on every row:
library(tidyverse)
df %>% group_by(state) %>%
mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups: state [3]
#> state start end date treat result
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1 1999 2000 2001 1 2
#> 2 1 1998 2000 2001 1 2
#> 3 1 2000 2003 NA 0 2
#> 4 2 2001 2002 NA 0 1
#> 5 2 2002 2004 2003 1 1
#> 6 2 2003 2004 2005 1 1
#> 7 3 2002 2004 2006 1 1
#> 8 3 2003 2004 NA 0 1
#> 9 3 2005 2007 NA 0 1
If you just want one number per group, summarize
might be a better option:
df %>% group_by(state) %>%
summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#> state result
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 3 1
Upvotes: 0