Vineet
Vineet

Reputation: 1572

Getting wrong result while removing all NA value columns in R

I am getting wrong result while removing all NA value column in R

data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))

Now I want to remove all the column which only has NA's

Approach 1: here I mean read all the column which has more than 0 sum and not NA

aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa)) 

154 columns

Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result

bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb)) 

60 columns (expected)

Can someone please help me to understand what is wrong in first statement and what is right in second one

Upvotes: 1

Views: 224

Answers (1)

Florian
Florian

Reputation: 25375

aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa)) 

You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.


bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb)) 

You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).

Example as requested in comment:

df  = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]

> df
  a  b  c
1 1 NA NA
2 2  1 NA
3 3  1 NA
> bb
  a
1 1
2 2
3 3

So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.

Upvotes: 1

Related Questions