Reputation: 87
I've got a dataset that tracks the revenue of millions of companies over many years. The data looks like the simplified version below:
dat <- data.frame(Company = c("a","b","c","d","e","f"), rev_2001 = c(NA, 20, 10, NA, NA, 10),
rev_2002 = c(10, 50, 20, 30, NA, 0), rev_2003 = c(20, NA, 0, NA, NA, 30), rev_2004 = c(NA, 60, 0, 50, NA, 50), rev_2005 = c(NA, 30, NA, 0, NA, 60))
I wanted to create a variable (a new column that we could label 'closure.year') that would capture the last year that the company had a revenue different from 0 or NA. I'm struggling with two things:
In sum, I would like to have final data that would look like:
Thank you so much!
Upvotes: 2
Views: 91
Reputation: 101335
You can try the code below
z <- do.call(cbind,Reduce(`|`,rev(replace(dat,is.na(dat),0)[-1]),accumulate = TRUE))
x <- max.col(z>0,"first")
dat$closure.year <- as.numeric(gsub(".*_","",names(dat[-1])[ncol(dat)-replace(x,x == 1,NA)]))
which gives
> dat
Company rev_2001 rev_2002 rev_2003 rev_2004 rev_2005 closure.year
1 a NA 10 20 NA NA 2003
2 b 20 50 NA 60 30 NA
3 c 10 20 0 0 NA 2002
4 d NA 30 NA 50 0 2004
5 e NA NA NA NA NA NA
6 f 10 0 30 50 60 NA
Upvotes: 1
Reputation: 153
How about converting from wide to long format?
df_long <- gather(df,year,value,rev_2001:rev_2005,factor_key=TRUE)
df_long %>% group_by(Company) %>% top_n(1, value)
You'll need to add some more logic to handle your 0 and NA conditions, but this approach may help.
Upvotes: 0