spore234
spore234

Reputation: 3640

Delete columns/rows with more than x% missing

I want to delete all columns or rows with more than 50% NAs in a data frame.

This is my solution:

# delete columns with more than 50% missings
miss <- c()
for(i in 1:ncol(data)) {
  if(length(which(is.na(data[,i]))) > 0.5*nrow(data)) miss <- append(miss,i) 
}
data2 <- data[,-miss]


# delete rows with more than 50% percent missing
miss2 <- c()
for(i in 1:nrow(data)) {
  if(length(which(is.na(data[i,]))) > 0.5*ncol(data)) miss2 <- append(miss2,i) 
}
data <- data[-miss,]

but I'm looking for a nicer/faster solution.

I would also appreciate a dplyr solution

Upvotes: 30

Views: 47827

Answers (5)

ah bon
ah bon

Reputation: 10011

Suppose we need to keep the sample data NHANES and columns with missing values less than or equal to 3%:

library(NHANES)
library(naniar)
library(dplyr)

select_cols <- naniar::miss_var_summary(NHANES) %>% 
  filter(pct_miss <= 3) %>% 
  pull(variable)
names.use <- names(NHANES)[(names(NHANES) %in% select_cols)]
NHANES %>% 
  select(c(names.use))
# NHANES[, c(names.use)]

Out:

enter image description here

Upvotes: 0

GuedesBF
GuedesBF

Reputation: 9858

A dplyr solution

For selecting columns based on a logical condition, we can use the selection helper where(), as in:

library(dplyr)

threshold <- 0.5 #for a 50% cut-off

df %>% select(where(~mean(is.na(.)) < threshold))

For filtering rows, dplyrs if_any() and if_all() will handle cases of 100 or 0% cutoffs, as in df %>% filter(if_any(everything(), ~is.na(.x))). For solutions with other threshold values, you can use rowMeans:

library(dplyr)

df %>% filter(rowMeans(is.na(.)) < threshold)

Upvotes: 5

abdoulsn
abdoulsn

Reputation: 1159

Here is another tips ro filter df which has 50 NaNs in columns:

## Remove columns with more than 50% NA
rawdf.prep1 <- rawdf[, sapply(rawdf, function(x) sum(is.na(x)))/nrow(rawdf)*100 <= 50]

This will result a df with only NaN in columns not greater to 50%.

Upvotes: 1

Rorschach
Rorschach

Reputation: 32426

To remove columns with some amount of NA, you can use colMeans(is.na(...))

## Some sample data
set.seed(0)
dat <- matrix(1:100, 10, 10)
dat[sample(1:100, 50)] <- NA
dat <- data.frame(dat)

## Remove columns with more than 50% NA
dat[, which(colMeans(!is.na(dat)) > 0.5)]

## Remove rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), ]

## Remove columns and rows with more than 50% NA
dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

Upvotes: 57

NelsonGon
NelsonGon

Reputation: 13309

A tidyverse solution that removes columns with an x% of NAs(50%) here:

test_data <- data.frame(A=c(rep(NA,12),
                            520,233,522),
                        B = c(rep(10,12),
                              520,233,522))
# Remove all with %NA >= 50
# can just use >50


 test_data %>% 
  purrr::discard(~sum(is.na(.x))/length(.x)* 100 >=50)

Result:

     B
1   10
2   10
3   10
4   10
5   10
6   10
7   10
8   10
9   10
10  10
11  10
12  10
13 520
14 233
15 522

Upvotes: 17

Related Questions