Reputation: 12054

Remove rows with all or some NAs (missing values) in data.frame

I'd like to remove the lines in this data frame that:

a) contain NAs across all columns. Below is my example data frame.

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

b) contain NAs in only some columns, so I can also get this result:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

Upvotes: 1087

Answers (20)

Maël

Reputation: 52319

Another option is the na_omit function of collapse:

library(collapse)
na_omit(df)
#              gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674    0    2    2    2    2
# 2 ENSG00000221312    0    1    2    3    2

Or, for selected columns:

na_omit(df, cols = c("rnor", "cfam")) #Alternatively, works with a function, index or logical vector

#              gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674    0    2    2    2    2
# 2 ENSG00000207604    0   NA   NA    1    2
# 3 ENSG00000221312    0    1    2    3    2

collapse::na_omit is faster than any other solutions provided here, including data.table::na.omit and tidyr::drop_na:

#Using data from https://stackoverflow.com/a/48830183/13460602
mb <- 
  microbenchmark(
  collapse = na_omit(dt),
  dt = na.omit(dt), 
  base = na.omit(df),
  complete.cases = df[complete.cases(df), ],
  tidyr = drop_na(df)
)

# Unit: milliseconds
#            expr      min       lq     mean   median        uq       max neval
#        collapse  70.6927 130.5431 191.9058 156.6320  217.1957  915.6074   100
#              dt  76.5151 130.7049 195.9737 172.6981  215.2754  735.1839   100
#            base 406.3456 778.5028 900.1791 860.7407 1005.6444 2009.2036   100
#  complete.cases 295.6927 497.8572 649.3397 575.9408  715.9452 2137.3366   100
#           tidyr  83.4586 145.3758 207.3149 180.9264  242.0750  838.3654   100

Upvotes: 1

jay.sf

Reputation: 73562

For removing rows with some missings, i.e. where the columns are not known beforehand, we have @PierreL's great answer using rowSums().

I would like to quickly point out a slightly faster way using rowSums2 from the matrixStats package. For a 200,000x1,000 matrix, this saves almost a second.

The function is applied in the same way. (Note, that the result of is.na() is already of class "matrix", otherwise rowSums2 would have complained.)

M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ]  ## allow 1% missings per row

Benchmark

Unit: seconds
     expr      min       lq     mean   median       uq      max neval
  rowSums 3.083927 3.104315 3.135399 3.131235 3.162080 3.209985   100
 rowSums2 2.129368 2.241577 2.347562 2.287717 2.494494 2.683859   100

Code:

m=2e5; n=1e3
set.seed(42)
M <- matrix(rpois(n*m, 2), m, n)
M[sample(seq_along(M), m*n*.01)] <- NA_integer_

microbenchmark::microbenchmark(
  rowSums=M[rowSums(is.na(M)) < dim(M)[2]*1e-2, ],
  rowSums2=M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ],
  check='identical'
  ) |> print()

Rscript --vanilla ~/is_na_issue.R

Upvotes: 0

GitZine

Reputation: 475

In case you want to remove only the rows having NAs in all columns, here is the solution:

df %>%
    filter(!if_all(everything(), ~  is.na(.)))

Upvotes: 4

kasia_b

Reputation: 276

dplyr 1.0.4 introduced two companion functions to filter: they are if_any() and if_all(). The if_all() companion function will be particularly useful in this case:

a) To remove rows that contain NAs across all columns

df %>% 
  filter(if_all(everything(), ~ !is.na(.x)))

This line will keep only those rows where none of the columns have NAs.

b) To remove rows that contain NAs in only some columns

cols_to_check = c("rnor", "cfam")

df %>% 
  filter(if_all(cols_to_check, ~ !is.na(.x)))

This line will check if any of the specified columns (cols_to_check) have NAs, and only keep those rows where this is not the case.

Upvotes: 15

bschneidr

Reputation: 6277

One approach that's both general and yields fairly-readable code is to use the filter() function and the across() helper functions from the {dplyr} package.

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries

df %>% 
  filter(across(one_of(vars_to_check),
                ~ !is.na(.x)))

# Filter all the columns to exclude NA
df %>% 
  filter(across(everything(),
                ~ !is.na(.)))

# Filter only numeric columns
df %>%
  filter(across(where(is.numeric),
                ~ !is.na(.)))

Similarly, there are also the variant functions in the dplyr package (filter_all, filter_at, filter_if) which accomplish the same thing:

library(dplyr)

vars_to_check <- c("rnor", "cfam")

# Filter a specific list of columns to keep only non-missing entries
df %>% 
  filter_at(.vars = vars(one_of(vars_to_check)),
            ~ !is.na(.))

# Filter all the columns to exclude NA
df %>% 
  filter_all(~ !is.na(.))

# Filter only numeric columns
df %>%
  filter_if(is.numeric,
            ~ !is.na(.))

Upvotes: 20

Joni Hoppen

Reputation: 688

My guess is that this could be more elegantly solved in this way:

  m <- matrix(1:25, ncol = 5)
  m[c(1, 6, 13, 25)] <- NA
  df <- data.frame(m)
  library(dplyr) 
  df %>%
  filter_all(any_vars(is.na(.)))
  #>   X1 X2 X3 X4 X5
  #> 1 NA NA 11 16 21
  #> 2  3  8 NA 18 23
  #> 3  5 10 15 20 NA

Upvotes: 4

lukeA

Reputation: 54247

tidyr has a new function drop_na:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2

Upvotes: 235

LegitMe

Reputation: 590

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

Upvotes: 19

sapy

Reputation: 9592

delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

Upvotes: 4

C8H10N4O2

Reputation: 19015

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

If you don't want to use `data.table`, use `complete.cases()`.

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

Your results may vary based on length, width, and sparsity of your particular dataset.

Note log scale on y axis.

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)

Upvotes: 60

Joris Meys

Reputation: 108583

Also check complete.cases :

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

Upvotes: 1296

Raminsu

Reputation: 299

Using dplyr package we can filter NA as follows:

dplyr::filter(df,  !is.na(columnname))

Upvotes: 23

Prradep

Reputation: 5716

Assuming dat as your dataframe, the expected output can be achieved using

1.rowSums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

Upvotes: 10

Jerry T

Reputation: 1690

I am a synthesizer:). Here I combined the answers into one function:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: http://stackoverflow.com/a/30461945/2292993
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}

Upvotes: 9

Pierre L

Reputation: 28461

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Upvotes: 68

Ramya Ural

Reputation: 140

We can also use the subset function for this.

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

Upvotes: 14

Leo

Reputation: 1845

This will return the rows that have at least ONE non-NA value.

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

final[rowSums(is.na(final))<(length(final)-1),]

Upvotes: 20

getting-there

Reputation: 1409

Another option if you want greater control over how rows are deemed to be invalid is

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

Upvotes: 53

donshikin

Reputation: 1503

I prefer following way to check whether rows contain any NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

sum(row.has.na)

and eventually drop them

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

Upvotes: 105

Roman Luštrik

Reputation: 70653

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

Upvotes: 324

Remove rows with all or some NAs (missing values) in data.frame

Answers (20)

Benchmark

If performance is a priority, use data.table and na.omit() with optional param cols=.

If you don't want to use data.table, use complete.cases().

Benchmark result

Benchmark script

Related Questions

If performance is a priority, use `data.table` and `na.omit()` with optional param `cols=`.

If you don't want to use `data.table`, use `complete.cases()`.