Reputation: 411
I have a data frame that looks like the following:
Year Day ID V1 V2 ....
2003 35 1102 3 6
2003 35 1103 5 NA
2003 35 1104 8 100
.....
2003 40 1102 NA 8
2003 40 1103 NA 10
2003 40 1104 9 NA
.....
.....
2018 49 1104 5 NA
.....
2018 50 1102 3 6
2018 50 1103 7 NA
2018 50 1104 NA 100
I would like to build a data frame that extracts, for each combination of Year and ID, the the latest (high value per the Day column) non-NA value in V1, V2... Based on the above data set, for Year = 2018 and ID = 1104, I would like to extract V1 = 5 (on Day = 49) and V2 = 100 (on Day = 50). If all values for that Year and ID combination are NA then I would like it to return NA.
Upvotes: 1
Views: 629
Reputation: 388817
We can create a function which gives us the latest non-NA value based on Day
for each Vn
column
get_last_non_NA_value <- function(x) {
x[which.max(cumsum(!is.na(x)))]
}
and then apply that function for each Year
and ID
library(dplyr)
df %>%
group_by(Year, ID) %>%
summarise_at(vars(V1:V2), funs(get_last_non_NA_value(.[order(Day)])))
# Year ID V1 V2
# <int> <int> <int> <int>
#1 2003 1102 3 8
#2 2003 1103 5 10
#3 2003 1104 9 100
#4 2018 1102 3 6
#5 2018 1103 7 NA
#6 2018 1104 5 100
EDIT
If we also want to extract corresponding Day
for each value, we can change the function to return both values as comma-separated string
get_last_non_NA_value <- function(x, y) {
ind <- which.max(cumsum(!is.na(x[order(y)])))
paste(x[ind], y[ind], sep = ",")
}
and then use cSplit
to separate these comma separated values into different columns.
library(dplyr)
library(splitstackshape)
cols <- c("V1", "V2")
df %>%
group_by(Year, ID) %>%
summarise_at(cols, funs(get_last_non_NA_value(., Day))) %>%
cSplit(cols) %>%
rename_at(vars(contains("_1")), funs(sub("_1", "_last_value", .))) %>%
rename_at(vars(contains("_2")), funs(sub("_2", "_days", .)))
# Year ID V1_last_value V1_days V2_last_value V2_days
#1: 2003 1102 3 35 8 40
#2: 2003 1103 5 35 10 40
#3: 2003 1104 9 40 100 35
#4: 2018 1102 3 50 6 50
#5: 2018 1103 7 50 NA 50
#6: 2018 1104 5 49 100 50
Note that rename_at
part renames the columns for better understanding of what value it holds, you can skip that part if you are not interested in renaming columns.
data
df <- structure(list(Year = c(2003L, 2003L, 2003L, 2003L, 2003L, 2003L,
2018L, 2018L, 2018L, 2018L), Day = c(35L, 35L, 35L, 40L, 40L,
40L, 49L, 50L, 50L, 50L), ID = c(1102L, 1103L, 1104L, 1102L,
1103L, 1104L, 1104L, 1102L, 1103L, 1104L), V1 = c(3L, 5L, 8L,
NA, NA, 9L, 5L, 3L, 7L, NA), V2 = c(6L, NA, 100L, 8L, 10L, NA,
NA, 6L, NA, 100L)), .Names = c("Year", "Day", "ID", "V1", "V2"
), class = "data.frame", row.names = c(NA, -10L))
Upvotes: 1
Reputation: 3183
You can use dplyr
Assuming you want max for V1 and V2
library(dplyr)
df %>%
group_by(Year, ID) %>%
summarise(Day = max(Day, na.rm = TRUE),
V1 = max(V1, na.rm = TRUE),
V2 = max(V2, na.rm = TRUE))
If for V1 and V2, you want first non-NA then
df %>%
group_by(Year, ID) %>%
summarise(Day = max(Day, na.rm = TRUE),
V1 = first(setdiff(V1, NA)),
V2 = first(setdiff(V1, NA)))
Upvotes: 0