Reputation: 4201
I want to keep distinct rows on a data frame, with an algorithm that chooses the last value per group (as dplyr::distinct()
does by default), but only if it's not NA
. I've seen this great answer on SO that relies on data.table
, but I can't scale it to data with more than one grouping variable.
To demonstrate the problem, I start with the minimal example that does work, then scale it up. So first, consider the following data:
library(tibble)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
I want to get the distinct type
values per id
, by choosing the last value unless it's NA
. If the last is NA
then go up until there's non-NA
. So this answer shows us how to do it with data.table
:
library(data.table)
dt_id_and_type <- as.data.table(df_id_and_type)
dt_id_and_type$typena <- is.na(dt_id_and_type$type)
setorderv(dt_id_and_type, c("typena","id"), order = c(-1, 1))
dt_id_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE]
#> id type
#> 1: 1 A
#> 2: 2 B
#> 3: 3 D
#> 4: 4 C
#> 5: 5 A
#> 6: 6 B
But what to do if we have more than one grouping variable (i.e, not only id
)? In the following example I add a year
variable:
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
df_id_year_and_type
#> # A tibble: 13 x 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 1 2002 <NA>
#> 3 2 2008 B
#> 4 3 2010 A
#> 5 3 2010 <NA>
#> 6 3 2010 D
#> 7 3 2013 <NA>
#> 8 4 2020 <NA>
#> 9 4 2020 C
#> 10 5 2009 A
#> 11 6 2010 <NA>
#> 12 6 2010 B
#> 13 6 2012 <NA>
My expected output would be:
## # A tibble: 8 x 3
## id year type
## <dbl> <dbl> <chr>
## 1 1 2002 A
## 2 2 2008 B
## 3 3 2010 D
## 4 3 2013 NA # for id 3 in year 2013 there was only `NA`, so that's what we get
## 5 4 2020 C
## 6 5 2009 A
## 7 6 2010 B
## 8 6 2012 NA # same as comment above
Any idea how I could scale the solution that worked in 1-grouping-var case to the current data? The first 2 lines of code are no-brainer:
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
dt_id_year_and_type$typena <- is.na(dt_id_year_and_type$type)
setorderv(dt_id_year_and_type, c("typena","id"), order = c(-1, 1)) # <--- how to account for `year`?
dt_id_year_and_type[!duplicated(id, fromLast = TRUE), c("id", "type"), with = FALSE] # <--- here too...
Upvotes: 0
Views: 1137
Reputation: 66
Why not using a simple max ?
setDT(df_id_year_and_type)
df_id_year_and_type[,max(type, na.rm=T), by=.(id, year)]
You will get a warning when there is only NA and the option na.rm is TRUE, but you can suppress it easily:
df_id_year_and_type[,suppressWarnings(max(type, na.rm=T)), by=.(id, year)]
Or alternatively, test if all values are NA:
df_id_year_and_type[,ifelse(all(is.na(type)), NA_character_, max(type, na.rm=T)), by=.(id, year)]
Upvotes: 0
Reputation: 191
library(dplyr)
A simple, easy to read example of the basic case is
df_id_and_type %>% filter(!is.na(type)) %>%
filter(id != lead(id) | id == max(id))
extending to the second criteria
df_id_year_and_type %>% filter(!is.na(type)) %>%
filter((id != lead(id) | id == max(id)) &
(year != lead(year) | year == max(year)))
It is clear and easy to understand. If you wish to retain the distinct groupings with no result you can either merge distinct back or insert another OR clause in the filters
Upvotes: 0
Reputation: 6489
Here some data.table-based solutions.
setDT(df_id_year_and_type)
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
Upvotes: 2
Reputation: 25323
Another possible solution:
library(tidyverse)
df_id_year_and_type %>%
group_by(id, year) %>%
fill(type, .direction = "downup") %>%
summarise(type = last(type), .groups = "drop")
#> # A tibble: 8 × 3
#> id year type
#> <dbl> <dbl> <chr>
#> 1 1 2002 A
#> 2 2 2008 B
#> 3 3 2010 D
#> 4 3 2013 <NA>
#> 5 4 2020 C
#> 6 5 2009 A
#> 7 6 2010 B
#> 8 6 2012 <NA>
Upvotes: 1
Reputation: 6956
I would propose this solution in which you exclude the unwanted rows prior to unique
. If all observations for a group are NA
, sum(is.na(x)) / .N
is equal to 1 and we proceed from there
library(tibble)
library(data.table)
df_id_and_type <-
tibble::tribble(
~id, ~type,
1, "A",
1, NA,
2, "B",
3, "A",
3, NA,
3, "D",
3, NA,
4, NA,
4, "C",
5, "A",
6, NA,
6, "B",
6, NA
)
df_id_year_and_type <-
df_id_and_type %>%
add_column(year = c(2002, 2002, 2008, 2010, 2010, 2010, 2013, 2020, 2020, 2009, 2010, 2010, 2012),
.before = "type")
# convert to data.table
dt_id_year_and_type <- as.data.table(df_id_year_and_type)
# define grouping vars
grouping_vars <- c("id", "year")
# are all types na for a group?
dt_id_year_and_type[, na_ratio := sum(is.na(type)) / .N,
by = c(grouping_vars)]
# remove all lines that are NA, except they are from a group in which all
# observations are NA
dt_id_year_and_type <- dt_id_year_and_type[na_ratio == 1 | !is.na(type)]
# sort correctly
setorderv(dt_id_year_and_type, grouping_vars)
dt_id_year_and_type
#> id year type na_ratio
#> 1: 1 2002 A 0.5000000
#> 2: 2 2008 B 0.0000000
#> 3: 3 2010 A 0.3333333
#> 4: 3 2010 D 0.3333333
#> 5: 3 2013 <NA> 1.0000000
#> 6: 4 2020 C 0.5000000
#> 7: 5 2009 A 0.0000000
#> 8: 6 2010 B 0.5000000
#> 9: 6 2012 <NA> 1.0000000
# keep only the last observation of each group
dt_unique <- unique(dt_id_year_and_type, by = grouping_vars, fromLast = TRUE)
remove no longer needed helper column
dt_unique[, na_ratio := NULL]
dt_unique
#> id year type
#> 1: 1 2002 A
#> 2: 2 2008 B
#> 3: 3 2010 D
#> 4: 3 2013 <NA>
#> 5: 4 2020 C
#> 6: 5 2009 A
#> 7: 6 2010 B
#> 8: 6 2012 <NA>
Upvotes: 1