KT_1
KT_1

Reputation: 8494

Populate column which a range of possible values in other columns: in R

For an example dataframe:

df <- structure(list(region = structure(1:8, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h"), class = "factor"), y.2012 = c(5.5, 
NA, 4.7, 3.6, NA, NA, 4.6, NA), y.2013 = c(5.7, NA, NA, 3.8, 
NA, 6.2, NA, NA), y.2014 = c(NA, 6.3, NA, 4.1, 5.1, NA, NA, NA
)), .Names = c("region", "y.2012", "y.2013", "y.2014"), class = "data.frame", row.names = c(NA, 
-8L))

I want to add an extra column (called 'variable') that is populated by the most recent year of data that is available. For example, 2014 is preferable, followed by 2013, and then 2012. If no data is available, then NA can be left.

I want to the resulting dataframe to look like:

  region y.2012 y.2013 y.2014 variable
1      a    5.5    5.7     NA      5.7
2      b     NA     NA    6.3      6.3
3      c    4.7     NA     NA      4.7
4      d    3.6    3.8    4.1      4.1
5      e     NA     NA    5.1      5.1
6      f     NA    6.2     NA      6.2
7      g    4.6     NA     NA      4.6
8      h     NA     NA     NA       NA

What is the best way to tackle my problem? Any advice would be most appreciated.

Upvotes: 2

Views: 85

Answers (4)

chappers
chappers

Reputation: 2415

Another way is to use dplyr library which has the coalesce function (in the development version!). Here we can specify the particular order of columns you want (which may be the case if you have lots of variables in this table, or you have a particular order.

> library(dplyr)
> df %>% mutate(variable = coalesce(y.2014, y.2013, y.2012)) # specify order of columns here
  region y.2012 y.2013 y.2014 variable
1      a    5.5    5.7     NA      5.7
2      b     NA     NA    6.3      6.3
3      c    4.7     NA     NA      4.7
4      d    3.6    3.8    4.1      4.1
5      e     NA     NA    5.1      5.1
6      f     NA    6.2     NA      6.2
7      g    4.6     NA     NA      4.6
8      h     NA     NA     NA       NA

Upvotes: 1

David Arenburg
David Arenburg

Reputation: 92292

From what it looks like, each year the values increase. Hence, an efficient/straightforward solution could be to use pmax

do.call(pmax, c(na.rm = TRUE, df[-1]))
# [1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6  NA

Upvotes: 1

shrgm
shrgm

Reputation: 1344

If you know that your data is just for three years, you could just use the more intuitive:

selectYear <- function(y12,y13,y14) {
    selectedYear <- ifelse(is.na(y14),
                           ifelse(is.na(y13),
                                  y12,
                                  y13),
                           y14)
    return(selectedYear)
}

df$answer <- selectYear(df$y.2012,df$y.2013,df$y.2014)

With more years, though, this method would get a lot more complicated, and akrun's answer would work better.

Upvotes: 0

akrun
akrun

Reputation: 887118

We can use apply with MARGIN=1 to loop over the rows and get the last non-NA value (as the columns are ordered from smallest to largest years)

df$variable <- apply(df[-1], 1, function(x) {
                       i1 <- tail(x[!is.na(x)],1)
                     if(length(i1)>0) i1 else NA})
df$variable
#[1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6  NA

Or another option is with max.col and rowSums

df[-1][cbind(1:nrow(df),max.col(!is.na(df[-1]), "last")*
             (NA^!rowSums(!is.na(df[-1]))))]
#[1] 5.7 6.3 4.7 4.1 5.1 6.2 4.6  NA

Upvotes: 2

Related Questions