Marco Mello
Marco Mello

Reputation: 175

Count observations per group satisfying a different condition for each row

I have a dataframe that looks like this one

state start end  date treat
1     1999  2000 2001 1
1     1998  2000 2001 1
1     2000  2003 NA   0
2     2001  2002 NA   0
2     2002  2004 2003 1
2     2003  2004 2005 1
3     2002  2004 2006 1
3     2003  2004 NA   0
3     2005  2007 NA   0

I want to group it by state identifier and, for each state, I want compute the number of treated observation (treat) the date of which lies in between start and end.

In other words I want to get the following

state start end  date treat result
1     1999  2000 2001 1     0
1     1998  2000 2001 1     0
1     2000  2003 NA   0     2
2     2001  2002 NA   0     0
2     2002  2004 2003 1     1
2     2003  2004 2005 1     0
3     2002  2004 2006 1     0
3     2003  2004 NA   0     0
3     2005  2008 NA   0     1

For instance, result in the first row is equal to 0 because within state = 1 there is no date between 1999 and 2000. On the other hand, result in the last row is equal to one because within state 3 I have one treated unit the date of which lies between 2005 and 2008 (in particular date = 2006 in the 7th row).

Thank you very much for your help.

Upvotes: 0

Views: 135

Answers (2)

GKi
GKi

Reputation: 39657

You can split by state and combine two outer with & testing if date is between start and end and then sum treat for those matching dates.

x$result  <- unlist(lapply(split(x, x$state), function(y) {
  tt  <- outer(y$start, y$date, "<") & outer(y$end, y$date, ">")
  tt[is.na(tt)]  <- TRUE
  apply(tt, 1, function(z) sum(y$treat[z]))
}))
x
#  state start  end date treat result
#1     1  1999 2000 2001     1      0
#2     1  1998 2000 2001     1      0
#3     1  2000 2003   NA     0      2
#4     2  2001 2002   NA     0      0
#5     2  2002 2004 2003     1      1
#6     2  2003 2004 2005     1      0
#7     3  2002 2004 2006     1      0
#8     3  2003 2004   NA     0      0
#9     3  2005 2007   NA     0      1

Or you take the part describing the treat per state and date and merge it with the part describing state, start and end and sum the matching treat.

tt <- aggregate(treat ~ state + date, x[,c("state", "date", "treat")], sum)
tt <- merge(x[,c("state", "start", "end")], tt)
tt$treat[tt$start >= tt$date | tt$end <= tt$date]  <- 0
aggregate(treat ~ start + end + state, tt, sum)
#  start  end state treat
#1  1998 2000     1     0
#2  1999 2000     1     0
#3  2000 2003     1     2
#4  2001 2002     2     0
#5  2002 2004     2     1
#6  2003 2004     2     0
#7  2002 2004     3     0
#8  2003 2004     3     0
#9  2005 2007     3     1

Upvotes: 2

Kent Johnson
Kent Johnson

Reputation: 3388

This gives your numbers though it repeats them on every row:

library(tidyverse)

df %>% group_by(state) %>% 
  mutate(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 9 x 6
#> # Groups:   state [3]
#>   state start   end  date treat result
#>   <dbl> <dbl> <dbl> <dbl> <dbl>  <int>
#> 1     1  1999  2000  2001     1      2
#> 2     1  1998  2000  2001     1      2
#> 3     1  2000  2003    NA     0      2
#> 4     2  2001  2002    NA     0      1
#> 5     2  2002  2004  2003     1      1
#> 6     2  2003  2004  2005     1      1
#> 7     3  2002  2004  2006     1      1
#> 8     3  2003  2004    NA     0      1
#> 9     3  2005  2007    NA     0      1

If you just want one number per group, summarize might be a better option:

df %>% group_by(state) %>% 
  summarize(result=sum(treat==1 & date>=min(start, na.rm=TRUE) & date<=max(end, na.rm=TRUE), na.rm=TRUE))
#> # A tibble: 3 x 2
#>   state result
#>   <dbl>  <int>
#> 1     1      2
#> 2     2      1
#> 3     3      1

Upvotes: 0

Related Questions