Reputation: 793
I have the following data:
library(tidyverse)
set.seed(1)
test <- data.frame(id = c(rep(1, 3), rep(2, 4), rep(3, 5)),
Year = 2000 + c(1,3,5,2,3,5,6,1,2,3,4,5),
var1 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)),
var2 = sample(0:2, replace = TRUE, size = 12, prob = c(0.6, 0.3, 0.1)))
I need to the first year that each variable (var1
and var2
) is non-zero within each id group.
I know how to find the row number of the first non-zero row:
temp <- function(a) ifelse(length(head(which(a>0),1))==0,0,head(which(a>0),1))
test2 <- test %>% group_by(id) %>%
mutate_at(vars(var1:var2),funs(temp)) %>%
filter(row_number()==1) %>% select (-year)
id var1 var2
1 1 0 1
2 2 1 2
3 3 1 1
However, I am not sure how to match the row number back to the year variable so that I will know exactly when did the var1
and var2
turn non-zero, instead of only having the row numbers.
This is what I want:
id var1 var2
1 1 0 2001
2 2 2002 2003
3 3 2001 2001
Upvotes: 4
Views: 128
Reputation: 93813
A slightly different approach gathering everything into a big long file first:
test %>%
gather(var, value, var1:var2) %>%
filter(value != 0) %>%
group_by(id, var) %>%
summarise(Year = min(Year)) %>%
spread(var, Year)
## A tibble: 3 x 3
## Groups: id [3]
# id var1 var2
#* <dbl> <dbl> <dbl>
#1 1.00 NA 2001
#2 2.00 2002 2003
#3 3.00 2001 2001
And a base R version for fun:
tmp <- cbind(test[c("id", "Year")], stack(test[c("var1","var2")]))
tmp <- tmp[tmp$values != 0,]
tmp <- aggregate(Year ~ id + ind, data=tmp, FUN=min)
reshape(tmp[c("id","ind","Year")], idvar="id", timevar="ind", direction="wide")
Upvotes: 1
Reputation: 48211
We may do the following:
test %>% group_by(id) %>% summarise_at(vars(var1:var2), funs(Year[. > 0][1]))
# A tibble: 3 x 3
# id var1 var2
# <dbl> <dbl> <dbl>
# 1 1 NA 2001
# 2 2 2002 2003
# 3 3 2001 2001
That is, . > 0
gives a logical vector with TRUE
whenever a value is positive, then we select all the corresponding years, and lastly pick only the first one.
That's very similar to your approach. Notice that due to using summarise
I no longer need filter(row_number()==1) %>% select (-year)
. Also, my function corresponding to temp
is more concise.
Upvotes: 5