Reputation: 2304
I have the following sample dataset:
library(tidyverse)
dataset <- data.frame(id = c("A","A","B","B","C","A","C","B"),
value = c(100, 500, 200, 100, 500, 300, 400, 100),
status = c(NA, "Valid", NA, NA, "Pend", NA, NA, NA),
stringsAsFactors = FALSE)
What I need is to extract this unique ids with the highest value and have how much it repeats and the first non-NA status.
I have solved it in this way:
dataset_count <- dataset %>% group_by(id) %>%
summarise(count = n(), comment = max(status, na.rm = TRUE)) %>% ungroup()
dataset_cross <- dataset %>% arrange(desc(value)) %>%
left_join(dataset_count) %>% distinct(id, .keep_all = TRUE)
but since my original dataset has 120 variables and more rules to follow I would like to know if there is a way to make it more compact. For example I read about coalesce, but it doesn't allow me to extract the first NA in a grouped data. Please, could you give some advice? Thank you.
Upvotes: 0
Views: 58
Reputation: 102379
Here is a base R solution
dfout <- do.call(rbind,
c(make.row.names = F,
lapply(split(dataset,dataset$id),
function(v) {
data.frame(
id = unique(v["id"]),
value = max(v["value"]),
count = nrow(v),
status = v$status[which.max(!is.na(v$status))]
)
})))
such that
> dfout
id value count status
1 A 500 3 Valid
2 B 200 3 <NA>
3 C 500 2 Pend
Upvotes: 1
Reputation: 389175
You could get max value using max
, count number of rows using n()
and first non-NA value with which.max
for each id
.
library(dplyr)
dataset %>%
group_by(id) %>%
summarise(value = max(value),
count = n(),
status = status[which.max(!is.na(status))])
# id value count status
# <chr> <dbl> <int> <chr>
#1 A 500 3 Valid
#2 B 200 3 NA
#3 C 500 2 Pend
Upvotes: 2