Reputation: 3640
I want to delete all columns or rows with more than 50% NA
s in a data frame.
This is my solution:
# delete columns with more than 50% missings
miss <- c()
for(i in 1:ncol(data)) {
if(length(which(is.na(data[,i]))) > 0.5*nrow(data)) miss <- append(miss,i)
}
data2 <- data[,-miss]
# delete rows with more than 50% percent missing
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[i,]))) > 0.5*ncol(data)) miss2 <- append(miss2,i)
}
data <- data[-miss,]
but I'm looking for a nicer/faster solution.
I would also appreciate a dplyr
solution
Upvotes: 30
Views: 47827
Reputation: 10011
Suppose we need to keep the sample data NHANES
and columns with missing values less than or equal to 3%:
library(NHANES)
library(naniar)
library(dplyr)
select_cols <- naniar::miss_var_summary(NHANES) %>%
filter(pct_miss <= 3) %>%
pull(variable)
names.use <- names(NHANES)[(names(NHANES) %in% select_cols)]
NHANES %>%
select(c(names.use))
# NHANES[, c(names.use)]
Out:
Upvotes: 0
Reputation: 9858
A dplyr solution
For select
ing columns based on a logical condition, we can use the selection helper where()
, as in:
library(dplyr)
threshold <- 0.5 #for a 50% cut-off
df %>% select(where(~mean(is.na(.)) < threshold))
For filter
ing rows, dplyr
s if_any()
and if_all()
will handle cases of 100 or 0% cutoffs, as in df %>% filter(if_any(everything(), ~is.na(.x)))
.
For solutions with other threshold values, you can use rowMeans
:
library(dplyr)
df %>% filter(rowMeans(is.na(.)) < threshold)
Upvotes: 5
Reputation: 1159
Here is another tips ro filter df which has 50 NaNs in columns:
## Remove columns with more than 50% NA
rawdf.prep1 <- rawdf[, sapply(rawdf, function(x) sum(is.na(x)))/nrow(rawdf)*100 <= 50]
This will result a df with only NaN in columns not greater to 50%.
Upvotes: 1
Reputation: 32426
To remove columns with some amount of NA, you can use
colMeans(is.na(...))
## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)
## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]
## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]
## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]
Upvotes: 57
Reputation: 13309
A tidyverse
solution that removes columns with an x% of NA
s(50%) here:
test_data <- data.frame(A=c(rep(NA,12),
520,233,522),
B = c(rep(10,12),
520,233,522))
# Remove all with %NA >= 50
# can just use >50
test_data %>%
purrr::discard(~sum(is.na(.x))/length(.x)* 100 >=50)
Result:
B
1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
10 10
11 10
12 10
13 520
14 233
15 522
Upvotes: 17