Deleting columns with more than 50% of zeros and also columns with more than 25% of "na"

I need to delete the columns with more than 50% of zeros and also columns with more than 25% of na. I tried using clean function deleting first the na in the read.csv(...., na.string="na"). Then I used write.csv and tried to use read.csv for the new data file to use the clean function again using read.csv(...., na.string="0") but an error comes up saying

ERROR:This data set does not require cleaning.

Is there a way to use apply and execute a function for both zeros and NAs?

Sorry I am new to R.

Upvotes: 0

Views: 673

Answers (2)

moodymudskipper
moodymudskipper

Reputation: 47320

data

set.seed(1);
df <- as.data.frame(matrix(sample(c(1,1,0,NA),42,T), ncol = 6));
#   V1 V2 V3 V4 V5 V6
# 1  1  0 NA  1 NA  0
# 2  1  0  1  0  1 NA
# 3  0  1  0  1  1  1
# 4 NA  1 NA  1  0  0
# 5  1  1  1  1  1  1
# 6 NA  0 NA  1  1 NA
# 7 NA  1 NA  1 NA  0

solution

df[,colSums(df==0,na.rm = T)/nrow(df) < 0.25 & colSums(is.na(df))/nrow(df) < 0.5]

#   V2 V4 V5 V6
# 1  0  1  1  1
# 2  1  0 NA NA
# 3 NA  1  1 NA
# 4  1  1  1  1
# 5  1 NA  1  0
# 6  1  1 NA  1
# 7  1 NA NA  1

Upvotes: 1

Maurits Evers
Maurits Evers

Reputation: 50678

You can do something like this using sapply to directly return indexes of those columns that have >=50% valid (i.e. non-null and non-NA) entries. Since you don't provide any sample dataset, I am generating a sample data.frame.

# Sample data
set.seed(2017);
x <- sample(20);
x[sample(1:20, 5)] <- NA;
x[sample(1:20, 5)] <- 0;
df <- as.data.frame(matrix(x, ncol = 4));
df;
#  V1 V2 V3 V4
#1 19  0  7  2
#2  0  1  0 NA
#3  9 NA NA 15
#4 NA 16 20  0
#5  0  4  3 NA

# 50% threshold
thresh <- 0.50;
df[, sapply(df, function(x) length(x[!(is.na(x) | x == 0)]) / length(x) >= 0.5)];
#  V2 V3
#1  0  7
#2  1  0
#3 NA NA
#4 16 20
#5  4  3

Explanation: x[!(is.na(x) | x == 0)] selects column entries that are non-null and non-NA; we then calculate the fraction of non-null & non-NA entries amongst all entries per column, and return indices of those columns that have a fraction >=0.5.

Upvotes: 0

Related Questions