Reputation: 8494
For an example dataframe:
df <- structure(list(region = structure(1:8, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h"), class = "factor"), y.2012 = c(5.5,
NA, 4.7, 3.6, NA, NA, 4.6, NA), y.2013 = c(5.7, NA, NA, 3.8,
NA, 6.2, NA, NA), y.2014 = c(NA, 6.3, NA, 4.1, 5.1, NA, NA, NA
)), .Names = c("region", "y.2012", "y.2013", "y.2014"), class = "data.frame", row.names = c(NA,
-8L))
I want to add an extra column (called 'variable') that is populated by the most recent year of data that is available. For example, 2014 is preferable, followed by 2013, and then 2012. If no data is available, then NA can be left.
I want to the resulting dataframe to look like:
region y.2012 y.2013 y.2014 variable
1 a 5.5 5.7 NA 5.7
2 b NA NA 6.3 6.3
3 c 4.7 NA NA 4.7
4 d 3.6 3.8 4.1 4.1
5 e NA NA 5.1 5.1
6 f NA 6.2 NA 6.2
7 g 4.6 NA NA 4.6
8 h NA NA NA NA
What is the best way to tackle my problem? Any advice would be most appreciated.
Upvotes: 2
Views: 85
Reputation: 2415
Another way is to use dplyr
library which has the coalesce
function (in the development version!). Here we can specify the particular order of columns you want (which may be the case if you have lots of variables in this table, or you have a particular order.
> library(dplyr)
> df %>% mutate(variable = coalesce(y.2014, y.2013, y.2012)) # specify order of columns here
region y.2012 y.2013 y.2014 variable
1 a 5.5 5.7 NA 5.7
2 b NA NA 6.3 6.3
3 c 4.7 NA NA 4.7
4 d 3.6 3.8 4.1 4.1
5 e NA NA 5.1 5.1
6 f NA 6.2 NA 6.2
7 g 4.6 NA NA 4.6
8 h NA NA NA NA
Upvotes: 1
Reputation: 92292
From what it looks like, each year the values increase. Hence, an efficient/straightforward solution could be to use pmax
do.call(pmax, c(na.rm = TRUE, df[-1]))
# [1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6 NA
Upvotes: 1
Reputation: 1344
If you know that your data is just for three years, you could just use the more intuitive:
selectYear <- function(y12,y13,y14) {
selectedYear <- ifelse(is.na(y14),
ifelse(is.na(y13),
y12,
y13),
y14)
return(selectedYear)
}
df$answer <- selectYear(df$y.2012,df$y.2013,df$y.2014)
With more years, though, this method would get a lot more complicated, and akrun's answer would work better.
Upvotes: 0
Reputation: 887118
We can use apply
with MARGIN=1
to loop over the rows and get the last non-NA value (as the columns are ordered from smallest to largest years)
df$variable <- apply(df[-1], 1, function(x) {
i1 <- tail(x[!is.na(x)],1)
if(length(i1)>0) i1 else NA})
df$variable
#[1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6 NA
Or another option is with max.col
and rowSums
df[-1][cbind(1:nrow(df),max.col(!is.na(df[-1]), "last")*
(NA^!rowSums(!is.na(df[-1]))))]
#[1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6 NA
Upvotes: 2