Reputation: 12054
I'd like to remove the lines in this data frame that:
a) contain NA
s across all columns. Below is my example data frame.
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Basically, I'd like to get a data frame such as the following.
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b) contain NA
s in only some columns, so I can also get this result:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Upvotes: 1087
Views: 2413001
Reputation: 52319
Another option is the na_omit
function of collapse
:
library(collapse)
na_omit(df)
# gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674 0 2 2 2 2
# 2 ENSG00000221312 0 1 2 3 2
Or, for selected columns:
na_omit(df, cols = c("rnor", "cfam")) #Alternatively, works with a function, index or logical vector
# gene hsap mmul mmus rnor cfam
# 1 ENSG00000199674 0 2 2 2 2
# 2 ENSG00000207604 0 NA NA 1 2
# 3 ENSG00000221312 0 1 2 3 2
collapse::na_omit
is faster than any other solutions provided here, including data.table::na.omit
and tidyr::drop_na
:
#Using data from https://stackoverflow.com/a/48830183/13460602
mb <-
microbenchmark(
collapse = na_omit(dt),
dt = na.omit(dt),
base = na.omit(df),
complete.cases = df[complete.cases(df), ],
tidyr = drop_na(df)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# collapse 70.6927 130.5431 191.9058 156.6320 217.1957 915.6074 100
# dt 76.5151 130.7049 195.9737 172.6981 215.2754 735.1839 100
# base 406.3456 778.5028 900.1791 860.7407 1005.6444 2009.2036 100
# complete.cases 295.6927 497.8572 649.3397 575.9408 715.9452 2137.3366 100
# tidyr 83.4586 145.3758 207.3149 180.9264 242.0750 838.3654 100
Upvotes: 1
Reputation: 73562
For removing rows with some missings, i.e. where the columns are not known beforehand, we have @PierreL's great answer using rowSums()
.
I would like to quickly point out a slightly faster way using rowSums2
from the matrixStats package. For a 200,000x1,000 matrix, this saves almost a second.
The function is applied in the same way. (Note, that the result of is.na()
is already of class "matrix"
, otherwise rowSums2
would have complained.)
M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ] ## allow 1% missings per row
Unit: seconds
expr min lq mean median uq max neval
rowSums 3.083927 3.104315 3.135399 3.131235 3.162080 3.209985 100
rowSums2 2.129368 2.241577 2.347562 2.287717 2.494494 2.683859 100
Code:
m=2e5; n=1e3
set.seed(42)
M <- matrix(rpois(n*m, 2), m, n)
M[sample(seq_along(M), m*n*.01)] <- NA_integer_
microbenchmark::microbenchmark(
rowSums=M[rowSums(is.na(M)) < dim(M)[2]*1e-2, ],
rowSums2=M[matrixStats::rowSums2(is.na(M)) < dim(M)[2]*1e-2, ],
check='identical'
) |> print()
Rscript --vanilla ~/is_na_issue.R
Upvotes: 0
Reputation: 475
In case you want to remove only the rows having NAs in all columns, here is the solution:
df %>%
filter(!if_all(everything(), ~ is.na(.)))
Upvotes: 4
Reputation: 276
dplyr 1.0.4 introduced two companion functions to filter
: they are if_any()
and if_all()
. The if_all()
companion function will be particularly useful in this case:
a) To remove rows that contain NAs across all columns
df %>%
filter(if_all(everything(), ~ !is.na(.x)))
This line will keep only those rows where none of the columns have NAs.
b) To remove rows that contain NAs in only some columns
cols_to_check = c("rnor", "cfam")
df %>%
filter(if_all(cols_to_check, ~ !is.na(.x)))
This line will check if any of the specified columns (cols_to_check) have NAs, and only keep those rows where this is not the case.
Upvotes: 15
Reputation: 6277
One approach that's both general and yields fairly-readable code is to use the filter()
function and the across()
helper functions from the {dplyr} package.
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter(across(one_of(vars_to_check),
~ !is.na(.x)))
# Filter all the columns to exclude NA
df %>%
filter(across(everything(),
~ !is.na(.)))
# Filter only numeric columns
df %>%
filter(across(where(is.numeric),
~ !is.na(.)))
Similarly, there are also the variant functions in the dplyr package (filter_all
, filter_at
, filter_if
) which accomplish the same thing:
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter_at(.vars = vars(one_of(vars_to_check)),
~ !is.na(.))
# Filter all the columns to exclude NA
df %>%
filter_all(~ !is.na(.))
# Filter only numeric columns
df %>%
filter_if(is.numeric,
~ !is.na(.))
Upvotes: 20
Reputation: 688
My guess is that this could be more elegantly solved in this way:
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA
Upvotes: 4
Reputation: 54247
tidyr
has a new function drop_na
:
library(tidyr)
df %>% drop_na()
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 6 ENSG00000221312 0 1 2 3 2
df %>% drop_na(rnor, cfam)
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 4 ENSG00000207604 0 NA NA 1 2
# 6 ENSG00000221312 0 1 2 3 2
Upvotes: 235
Reputation: 590
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
Upvotes: 19
Reputation: 9592
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
Upvotes: 4
Reputation: 19015
data.table
and na.omit()
with optional param cols=
.na.omit.data.table
is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
data.table
, use complete.cases()
.On a vanilla data.frame
, complete.cases
is faster than na.omit()
or dplyr::drop_na()
. Notice that na.omit.data.frame
does not support cols=
.
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
Your results may vary based on length, width, and sparsity of your particular dataset.
Note log scale on y axis.
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
Upvotes: 60
Reputation: 108583
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Upvotes: 1296
Reputation: 299
Using dplyr package we can filter NA as follows:
dplyr::filter(df, !is.na(columnname))
Upvotes: 23
Reputation: 5716
Assuming dat
as your dataframe, the expected output can be achieved using
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Upvotes: 10
Reputation: 1690
I am a synthesizer:). Here I combined the answers into one function:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://stackoverflow.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
Upvotes: 9
Reputation: 28461
If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
By default, it will eliminate all NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Or specify the maximum number of NAs allowed:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Upvotes: 68
Reputation: 140
We can also use the subset function for this.
finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))
This will give only those rows that do not have NA in both mmul and rnor
Upvotes: 14
Reputation: 1845
This will return the rows that have at least ONE non-NA value.
final[rowSums(is.na(final))<length(final),]
This will return the rows that have at least TWO non-NA value.
final[rowSums(is.na(final))<(length(final)-1),]
Upvotes: 20
Reputation: 1409
Another option if you want greater control over how rows are deemed to be invalid is
final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
Using the above, this:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Becomes:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
...where only row 5 is removed since it is the only row containing NAs for both rnor
AND cfam
. The boolean logic can then be changed to fit specific requirements.
Upvotes: 53
Reputation: 1503
I prefer following way to check whether rows contain any NAs:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:
sum(row.has.na)
and eventually drop them
final.filtered <- final[!row.has.na,]
For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.
Upvotes: 105
Reputation: 70653
Try na.omit(your.data.frame)
. As for the second question, try posting it as another question (for clarity).
Upvotes: 324