J. Doe
J. Doe

Reputation: 1730

Remove rows which have all NAs in certain columns

Suppose you have a dataframe with 9 columns. You want to remove cases which have all NAs in columns 5:9. It's not at all relevant if there are NAs in columns 1:4.

So far I have found functions that allow you to remove rows that have NAs in any of the columns 5:9, but I specifically need to remove only those that have all NAs in columns 5:9.

I wrote my own function to do this, but since I have 300k+ rows, it's very slow. I was wondering is there a more efficient way? This is my code:

remove.select.na<-function(x, cols){
  nrm<-vector("numeric")
  for (i in 1:nrow(x)){
    if (sum(is.na(x[i,cols]))<length(cols)){
      nrm<-c(nrm,i)
    }
    #Console output to track the progress
    cat('\r',paste0('Checking row ',i,' of ',nrow(x),' (', format(round(i/nrow(x)*100,2), nsmall = 2),'%).'))
    flush.console()
  }
  x<-x[nrm,]
  rm(nrm)
  return(x)
}

where x is the dataframe and cols is a vector containing names of the columns that should be checked for NAs.

Upvotes: 14

Views: 15069

Answers (5)

sbha
sbha

Reputation: 10422

Here are two dplyr options:

library(dplyr)
df <- data_frame(a = c(0, NA, 0, 4, NA, 0, 6), b = c(1, NA, 0, 4, NA, 0, NA), c = c(1, 0, 1, NA, NA, 0, NA))


# columns b and c would be the columns you don't want all NAs

df %>% 
  filter_at(vars(b, c), any_vars(!is.na(.)))

df %>% 
  filter_at(vars(b, c), any_vars(complete.cases(.)))

# A tibble: 5 x 3
      a     b     c
  <dbl> <dbl> <dbl>
1     0     1     1
2    NA    NA     6
3     0     6     1
4     4     4    NA
5     0     0     0

In the newer version of dplyr, use if_any

df %>% 
      filter(if_any(c(b, c), complete.cases))

Upvotes: 6

GKi
GKi

Reputation: 39647

You can use all with apply to find rows where all values are NA:

x[!apply(is.na(x[,5:9]), 1, all),]

or negate is.na and test for any:

x[apply(!is.na(x[,5:9]), 1, any),]

or using rowSums like @RHertel wher you dont need to calculate the number of selected rows:

x[rowSums(!is.na(x[,5:9])) > 0,]

Upvotes: 8

RHertel
RHertel

Reputation: 23788

This a one-liner to remove the rows with NA in all columns between 5 and 9. By combining rowSums() with is.na() it is easy to check whether all entries in these 5 columns are NA:

x <- x[rowSums(is.na(x[,5:9]))!=5,]

Upvotes: 12

Thiago Fernandes
Thiago Fernandes

Reputation: 273

lines=
   'V1   V2   V3   V4  
    A    10   20   NA   
    B    NA   NA   NA   
    C     5   20   3     
    D    15   20   4    
    E    NA   10   5'

df = read.table(textConnection(lines), header = T)

df[is.na(df)] = 'X'


attach(df)

x = subset(df, V2 == 'X'   &   V3 == 'X'   &   V4 == 'X')
df_new = df[-as.numeric(row.names(x)),]
df_new

#  V1 V2 V3 V4
#1  A 10 20  X
#3  C  5 20  3
#4  D 15 20  4
#5  E  X 10  5

detach(df)

Upvotes: 0

Luke C
Luke C

Reputation: 10291

I don't know that it's any faster than your function, but maybe you could use !any and is.na for each row of your data frame. With this example data:

set.seed(1234)
x = do.call(cbind, lapply(1:9, function(x) runif(10)))
x[sample(length(x), size = 70)] <- NA
x <- data.frame(x)

> x
     X1 X2   X3   X4   X5   X6   X7   X8  X9
1  0.11 NA   NA 0.46 0.55 0.07   NA   NA  NA
2  0.62 NA   NA   NA   NA   NA 0.04   NA  NA
3    NA NA   NA 0.30   NA   NA   NA 0.01  NA
4  0.62 NA 0.04 0.51   NA   NA   NA   NA  NA
5  0.86 NA   NA 0.18   NA   NA   NA   NA 0.2
6  0.64 NA   NA   NA   NA 0.50   NA 0.52  NA
7    NA NA   NA   NA 0.68   NA   NA   NA  NA
8    NA NA   NA   NA   NA   NA   NA   NA  NA
9    NA NA   NA   NA   NA 0.17   NA   NA  NA
10   NA NA 0.05   NA   NA   NA   NA   NA  NA

Looks like the 4th, 8th, and 10th rows should be dropped. So, you can use apply to iterate over each row to see if the condition is satisfied- any row where with any values other than NA in the 5th to 9th column will return TRUE, so you can use that as an indexer for your data frame.

keep.rows <- apply(x[, 5:9], 1, FUN = function(row){
  any(!is.na(row))
})

> x[keep.rows, ]
    X1 X2 X3   X4   X5   X6   X7   X8  X9
1 0.11 NA NA 0.46 0.55 0.07   NA   NA  NA
2 0.62 NA NA   NA   NA   NA 0.04   NA  NA
3   NA NA NA 0.30   NA   NA   NA 0.01  NA
5 0.86 NA NA 0.18   NA   NA   NA   NA 0.2
6 0.64 NA NA   NA   NA 0.50   NA 0.52  NA
7   NA NA NA   NA 0.68   NA   NA   NA  NA
9   NA NA NA   NA   NA 0.17   NA   NA  NA

Again, not sure that it's faster than your function but... maybe?

Upvotes: 0

Related Questions