M.G.
M.G.

Reputation: 87

How to create a new column based on the NAs and ZEROs of multiple other columns in R?

I've got a dataset that tracks the revenue of millions of companies over many years. The data looks like the simplified version below:

dat <-   data.frame(Company = c("a","b","c","d","e","f"), rev_2001 = c(NA, 20, 10, NA, NA, 10), 
     rev_2002 = c(10, 50, 20, 30, NA, 0), rev_2003 = c(20, NA, 0, NA, NA, 30), rev_2004 = c(NA, 60, 0, 50, NA, 50), rev_2005 = c(NA, 30, NA, 0, NA, 60))

I wanted to create a variable (a new column that we could label 'closure.year') that would capture the last year that the company had a revenue different from 0 or NA. I'm struggling with two things:

  1. The fact that I want to ignore both zeros and NAs, but I don't want to convert NAs into zeros nor the other way around;
  2. As you can see, the data has some zeros and NAs not only in the last years observed, but also in some intermediate years. If the company had a revenue after a year with a 0 or NA, that year would not be considered as a closure.year. Also, if the company had not stopped receiving revenues, the variable closure.year would be NA.

In sum, I would like to have final data that would look like:

enter image description here

Thank you so much!

Upvotes: 2

Views: 91

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 101335

You can try the code below

z <- do.call(cbind,Reduce(`|`,rev(replace(dat,is.na(dat),0)[-1]),accumulate = TRUE))
x <- max.col(z>0,"first")
dat$closure.year <- as.numeric(gsub(".*_","",names(dat[-1])[ncol(dat)-replace(x,x == 1,NA)]))

which gives

> dat
  Company rev_2001 rev_2002 rev_2003 rev_2004 rev_2005 closure.year
1       a       NA       10       20       NA       NA         2003
2       b       20       50       NA       60       30           NA
3       c       10       20        0        0       NA         2002
4       d       NA       30       NA       50        0         2004
5       e       NA       NA       NA       NA       NA           NA
6       f       10        0       30       50       60           NA

Upvotes: 1

Emanuel V
Emanuel V

Reputation: 153

How about converting from wide to long format?

df_long <- gather(df,year,value,rev_2001:rev_2005,factor_key=TRUE)
df_long %>% group_by(Company) %>% top_n(1, value)

You'll need to add some more logic to handle your 0 and NA conditions, but this approach may help.

Upvotes: 0

Related Questions