Reputation: 1
I need to delete the columns with more than 50% of zeros and also columns with more than 25% of na. I tried using clean function deleting first the na in the read.csv(...., na.string="na")
. Then I used write.csv
and tried to use read.csv
for the new data file to use the clean function again using read.csv(...., na.string="0")
but an error comes up saying
ERROR:This data set does not require cleaning.
Is there a way to use apply and execute a function for both zeros and NA
s?
Sorry I am new to R.
Upvotes: 0
Views: 673
Reputation: 47320
data
set.seed(1);
df <- as.data.frame(matrix(sample(c(1,1,0,NA),42,T), ncol = 6));
# V1 V2 V3 V4 V5 V6
# 1 1 0 NA 1 NA 0
# 2 1 0 1 0 1 NA
# 3 0 1 0 1 1 1
# 4 NA 1 NA 1 0 0
# 5 1 1 1 1 1 1
# 6 NA 0 NA 1 1 NA
# 7 NA 1 NA 1 NA 0
solution
df[,colSums(df==0,na.rm = T)/nrow(df) < 0.25 & colSums(is.na(df))/nrow(df) < 0.5]
# V2 V4 V5 V6
# 1 0 1 1 1
# 2 1 0 NA NA
# 3 NA 1 1 NA
# 4 1 1 1 1
# 5 1 NA 1 0
# 6 1 1 NA 1
# 7 1 NA NA 1
Upvotes: 1
Reputation: 50678
You can do something like this using sapply
to directly return indexes of those columns that have >=50%
valid (i.e. non-null and non-NA
) entries. Since you don't provide any sample dataset, I am generating a sample data.frame
.
# Sample data
set.seed(2017);
x <- sample(20);
x[sample(1:20, 5)] <- NA;
x[sample(1:20, 5)] <- 0;
df <- as.data.frame(matrix(x, ncol = 4));
df;
# V1 V2 V3 V4
#1 19 0 7 2
#2 0 1 0 NA
#3 9 NA NA 15
#4 NA 16 20 0
#5 0 4 3 NA
# 50% threshold
thresh <- 0.50;
df[, sapply(df, function(x) length(x[!(is.na(x) | x == 0)]) / length(x) >= 0.5)];
# V2 V3
#1 0 7
#2 1 0
#3 NA NA
#4 16 20
#5 4 3
Explanation: x[!(is.na(x) | x == 0)]
selects column entries that are non-null and non-NA
; we then calculate the fraction of non-null & non-NA
entries amongst all entries per column, and return indices of those columns that have a fraction >=0.5
.
Upvotes: 0