Reputation: 4155
Is it possible to filter a data.frame for complete cases using dplyr? complete.cases
with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5)
)
df %.%
filter(complete.cases(x1,x2))
Upvotes: 130
Views: 84061
Reputation: 18732
dplyr >= 1.0.4
if_any
and if_all
are available in newer versions of dplyr
to apply across
-like syntax in the filter
function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":
library(dplyr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5),
y = c(NA, "A", "B", "C")
)
df %>%
dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))
x1 x2 y
1 1 1 <NA>
2 2 2 A
For more information on these functions see this link.
Upvotes: 1
Reputation: 270268
Try this:
df %>% na.omit
or this:
df %>% filter(complete.cases(.))
or this:
library(tidyr)
df %>% drop_na
If you want to filter based on one variable's missingness, use a conditional:
df %>% filter(!is.na(x1))
or
df %>% drop_na(x1)
Other answers indicate that of the solutions above na.omit
is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action
attribute whereas the other solutions above do not.
str(df %>% na.omit)
## 'data.frame': 2 obs. of 2 variables:
## $ x1: num 1 2
## $ x2: num 1 2
## - attr(*, "na.action")= 'omit' Named int 3 4
## ..- attr(*, "names")= chr "3" "4"
ADDED Have updated to reflect latest version of dplyr and comments.
ADDED Have updated to reflect latest version of tidyr and comments.
Upvotes: 238
Reputation: 4155
Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.
library('rbenchmark')
library('dplyr')
n = 5e6
n.na = 100000
df = data.frame(
x1 = sample(1:10, n, replace=TRUE),
x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA
benchmark(
df %>% filter(complete.cases(x1,x2)),
df %>% na.omit(),
df %>% (function(x) filter(x, complete.cases(x)))()
, replications=50)
# test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))() 50 5.422 1.000
# 1 df %.% filter(complete.cases(x1, x2)) 50 6.262 1.155
# 2 df %.% na.omit() 50 109.618 20.217
Upvotes: 18
Reputation: 2319
This is a short function which lets you specify columns (basically everything which dplyr::select
can understand) which should not have any NA values (modeled after pandas df.dropna()):
drop_na <- function(data, ...){
if (missing(...)){
f = complete.cases(data)
} else {
f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
}
filter(data, f)
}
[drop_na is now part of tidyr: the above can be replaced by library("tidyr")
]
Examples:
library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
Upvotes: 13
Reputation: 2481
Just for the sake of completeness, dplyr::filter
can be avoided altogether but still be able to compose chains just by using magrittr:extract
(an alias of [
):
library(magrittr)
df = data.frame(
x1 = c(1,2,3,NA),
x2 = c(1,2,NA,5))
df %>%
extract(complete.cases(.), )
The additional bonus is speed, this is the fastest method among the filter
and na.omit
variants (tested using @Miha Trošt microbenchmarks).
Upvotes: 3
Reputation: 2022
This works for me:
df %>%
filter(complete.cases(df))
Or a little more general:
library(dplyr) # 0.4
df %>% filter(complete.cases(.))
This would have the advantage that the data could have been modified in the chain before passing it to the filter.
Another benchmark with more columns:
set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
na.omit = {df %>% na.omit},
filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
filter = {df %>% filter(complete.cases(.))},
times = 20L,
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233 20
#filter.anonymous 1.149305 1.022891 1.013779 0.9948659 4.668691 20
# rowSums 2.281002 2.377807 2.420615 2.3467519 5.223077 20
# filter 1.000000 1.000000 1.000000 1.0000000 1.000000 20
Upvotes: 27
Reputation: 2011
try this
df[complete.cases(df),] #output to console
OR even this
df.complete <- df[complete.cases(df),] #assign to a new data.frame
The above commands take care of checking for completeness for all the columns (variable) in your data.frame.
Upvotes: 7