dbo
dbo

Reputation: 1234

remove rows of dates and duplicate dates for any date that has an NA

I'm looking to remove all dates that have any NAs/missing data for any observation type, from a data.frame that has duplicate dates. For example, here I'd like to end up with a data.frame of just three rows with 2018-12-05 data. My non-example data.frame is quite long, so I'm leaning toward run time over readability, if there are a few ways, and I'm open to tidyverse and non-tidyverse ways.

date_time <- rep(seq(from=as.POSIXct("2018-12-01", tz="GMT"), 
             to=as.POSIXct("2018-12-05", tz="GMT"), by="1 day"),3)
value     <- c(1,2,NA,NA,5,NA,NA,NA,4,5,7,NA,NA,NA,8)
class     <- c(rep("a", 5), rep("b", 5), rep("c", 5))
df        <- data.frame(date_time, value, class)

from:

enter image description here

filtered to:

enter image description here

Upvotes: 0

Views: 226

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

If you need speed, I don't know. But the following function keeps only the rows where the dates have all values of class.

fun <- function(DF){
  DF2 <- DF[!is.na(DF$value), ]
  u <- unique(DF2$class)
  sp <- split(DF2, DF2$date_time)
  inx <- sapply(sp, function(d){
    all(u %in% d$class)
  })
  DF2 <- do.call(rbind, sp[inx])
  row.names(DF2) <- NULL
  DF2
}

fun(df)
#   date_time value class
#1 2018-12-05     5     a
#2 2018-12-05     5     b
#3 2018-12-05     8     c

Edit.
Here is a comparative speed test. Camille's answer is faster for larger dataframes, where speed is more important. And is nicer.

library(microbenchmark)
library(ggplot2)
library(dplyr)

fun2 <- function(DF){
  DF %>%
    arrange(date_time, class) %>%
    group_by(date_time) %>%
    mutate(all_vals = all(!is.na(value))) %>%
    filter(all_vals)
}

mb <- microbenchmark(
  rui = fun(df),
  camille = fun2(df)
)
mb1 <- microbenchmark(
  rui = fun(df1),
  camille = fun2(df1)
)

ap <- autoplot(mb)
ap1 <- autoplot(mb1)
cowplot::plot_grid(ap, ap1)

enter image description here

Upvotes: 2

camille
camille

Reputation: 16832

Within some dplyr functions, you can use the base all. Group by date, then find whether all values for each group are non-NA.

With some extra steps to illustrate:

library(dplyr)

df %>%
  arrange(date_time, class) %>%
  group_by(date_time) %>%
  mutate(all_vals = all(!is.na(value))) %>%
  filter(all_vals)
#> # A tibble: 3 x 4
#> # Groups:   date_time [1]
#>   date_time           value class all_vals
#>   <dttm>              <dbl> <fct> <lgl>   
#> 1 2018-12-05 00:00:00     5 a     TRUE    
#> 2 2018-12-05 00:00:00     5 b     TRUE    
#> 3 2018-12-05 00:00:00     8 c     TRUE

Or more briefly, find the non-NAs within filter:

df %>%
  group_by(date_time) %>%
  filter(all(!is.na(value)))
#> # A tibble: 3 x 3
#> # Groups:   date_time [1]
#>   date_time           value class
#>   <dttm>              <dbl> <fct>
#> 1 2018-12-05 00:00:00     5 a    
#> 2 2018-12-05 00:00:00     5 b    
#> 3 2018-12-05 00:00:00     8 c

Created on 2018-12-01 by the reprex package (v0.2.1)

Upvotes: 1

Related Questions