Reputation: 147
I have a large dataset with duplicated values in the first column, like so:
ID date var1 var2
person1 052016 509 1678
person2 122016 301 NA
person1 072016 NA 45
I want to combine the IDs and to take the most recent value by "date", and if it`s NA - to take the last value that it's not NA. The output should be like this:
ID date var1 var2
person2 122016 301 NA
person1 072016 509 45
I have tried with this, but it didn't worked.
library(dplyr)
data %>% group_by(ID) %>% summarise_all(funs(max(data$date))) %>% funs(first(.[!is.na(.)]))
What should I use to apply a working code to the whole dataset?
Upvotes: 1
Views: 157
Reputation: 39154
A solution using dplyr.
library(dplyr)
dat2 <- dat %>%
arrange(ID, desc(date)) %>%
group_by(ID) %>%
summarise_all(funs(first(.[!is.na(.)]))) %>%
ungroup()
dat2
# # A tibble: 2 x 4
# ID date var1 var2
# <chr> <int> <int> <int>
# 1 person1 72016 509 45
# 2 person2 122016 301 NA
DATA
dat <- read.table(text = "ID date var1 var2
person1 '052016' 509 1678
person2 '122016' 301 NA
person1 '072016' NA 45",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 1
Reputation: 1498
Using tidyverse
and fill
function.
Load data:
Mar_df <- structure(list(ID = structure(c(1L, 2L, 1L), .Label = c("person1",
"person2"), class = "factor"), date = c(52016L, 122016L, 72016L
), var1 = c(509L, 301L, NA), var2 = c(1678L, NA, 45L)), .Names = c("ID",
"date", "var1", "var2"), class = "data.frame", row.names = c(NA,
-3L))
Then:
Mar_df_summarised <- Mar_df %>%
arrange(ID,date) %>%
fill(...=var1,.direction="down") %>%
group_by(ID) %>%
summarise_all(.funs=funs(last(.)))
The result is:
# A tibble: 2 x 4
ID date var1 var2
<fctr> <int> <int> <int>
1 person1 72016 509 45
2 person2 122016 301 NA
Upvotes: 1