Iftikhar
Iftikhar

Reputation: 667

How to remove a row which contain only missing values in R?

I have a large data set with 11 columns and 100000 rows (for example) in which i have values 1,2,3,4. Where 4 is a missing value. Some of the rows are completely missing. i.e. 4 in all 11 columns. For example

"4"  "4"  "4"  "4"  "4"  "4"  "4"  "4"  "4"  "4"   "4"

Now what i need is to remove only those rows which are completely missing. In simple words, i want to keep rows with missing value less than 11. I have used na.omit, but it does not work in my case.

Thanks in advance.

Upvotes: 5

Views: 11121

Answers (6)

mnel
mnel

Reputation: 115515

Using data.table for memory efficiency. The solution creating is.na(x) is creating a data set as large as the original, and thus is inefficient.

library(data.table)
DT <- as.data.table(m)
missing_value <- 4
missing <- as.data.table(setNames(as.list(rep(4, length(DT)), names(DT))
setkeyv(DT, names(DT))
setkey(missing, names(DT))

DT[-DT[(missing),which=T]]

both this and @JoshuaUlrich's solution are fast for large data

set.seed(21)
m <- matrix(sample(1:6, 1100000, replace=TRUE), ncol=11)
missVal <- 4
missing_rows <- sample(100000, 53)
m[missing_rows, ] <- rep(missVal, 11)

DT <- as.data.table(m)
setkeyv(DT, names(DT))
missing <- setNames(as.list(rep(missVal, 11)), names(DT))

system.time({DT1 <- DT[-DT[missing,which=T]]})
## user  system elapsed 
## 0.02    0.00    0.01 
system.time({m1 <- m[ rowSums((m==missVal)) != NCOL(m), ]})
## user  system elapsed 
## 0.02    0.02    0.03 

Upvotes: 2

Mark Miller
Mark Miller

Reputation: 13123

I found this solution elsewhere and am pasting it here using Andrie's code to generate the initial data set.

First generate the data set:

set.seed(123)
m <- matrix(sample(1:4, 30, prob=c(0.3, 0.3, 0.3, 0.1), replace=TRUE), ncol=6)
m[4, ] <- rep(4, 6)
m[m==4] <- NA
m

Here is the intial data set:

1    1    NA   2    2    2
2    3    3    1    2    3
3    2    2    1    2    3
NA   NA   NA   NA   NA   NA
NA   3    1    NA   2    1

Now remove rows that only contain missing observations:

m[rowSums(is.na(m))<ncol(m),] 

Here is the result:

1    1    NA   2    2    2
2    3    3    1    2    3
3    2    2    1    2    3
NA   3    1    NA   2    1

Upvotes: 3

Joshua Ulrich
Joshua Ulrich

Reputation: 176738

This is the fastest solution I can think of. I'll use example data similar to @Andrie.

set.seed(21)
m <- matrix(sample(1:6, 110, replace=TRUE), ncol=11)
missVal <- 4
m[4, ] <- rep(missVal, 11)
m <- m[ rowSums((m==missVal)) != NCOL(m), ]

The last line works because m==missVal returns a matrix of logical (TRUE/FALSE) values. rowSums converts TRUE to 1 and FALSE to 0, so in this case we know all the columns are 4 whenever rowSums returns 11.

Upvotes: 2

Andrie
Andrie

Reputation: 179588

Perhaps your best option is to utilise R's idiom for working with missing, or NA values. Once you have coded NA values you can work with complete.cases to easily achieve your objective.

Create some sample data with missing values (i.e. with value 4):

set.seed(123)
m <- matrix(sample(1:4, 30, prob=c(0.3, 0.3, 0.3, 0.1), replace=TRUE), ncol=6)
m[4, ] <- rep(4, 6)

Replace all values equal to 4 with NA:

m[m==4] <- NA
m
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1   NA    2    2    2
[2,]    2    3    3    1    2    3
[3,]    3    2    2    1    2    3
[4,]   NA   NA   NA   NA   NA   NA
[5,]   NA    3    1   NA    2    1

Now you can use a variety of functions that deal with NA values. For example, complete.cases will return only, you guessed it, complete cases:

m[complete.cases(m), ]

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    2    3    3    1    2    3
[2,]    3    2    2    1    2    3

For more information, see ?complete.cases or ?na.fail in the stats package.

Upvotes: 11

John
John

Reputation: 23768

A real speedy way would be to use a little bit of math. Assuming your dataframe is called datf

rsum <- rowSums(datf)
datf <- datf[rowSums != 44,] #11 * 4

(works for a matrix too)

Upvotes: 2

Nick Sabbe
Nick Sabbe

Reputation: 11956

Something like this should do the trick (and should work for both matrices and data.frames):

ac<-matrix(c("4","4","4","4","4","4","4","3","3","4","4", rep("4", 11)), nrow=2, ncol=11, byrow=TRUE)

rowsToRemove<-which(apply(ac, 1, function(currow){
    all(currow=="4")
}))

Now you can simply do

newac<-ac[-rowsToRemove,]

Upvotes: 1

Related Questions