Rosaline
Rosaline

Reputation: 1

Deleting rows if there are more than 25% NAs in multiple columns

I would like to delete all rows which contain more than 25% NAs in multiple columns. Due to the fact that this involves around 120 corresponding columns it would come in handy not to specify all of them. Preferably rather something like: everything between columnA and columnZ.

I have tried working with:

data[!is.na(data$ColumnA:data$ColumnZ), > 0.25]

But it only showed this error:

Must subset columns with a valid subscript vector.
Logical subscripts must match the size of the indexed input.
x Input has size 250 but subscript !is.na(data$ColumnA:data$ColumnZ) > 0.25 has size 3.

(250 is the actual amount of columns in the dataset)

I have also thought about trying drop_na but this results in a similar issue.

Do you have any suggestions? Thank you in advance

Upvotes: 0

Views: 321

Answers (2)

GuedesBF
GuedesBF

Reputation: 9858

You can use rowMeans(is.na(.)

Data:

set.seed(1)
df<-tibble(matrix(sample(c(1, NA), 64, replace=TRUE, prob = c(0.75, 0.25)), nrow=8))

# A tibble: 8 x 1
  `matrix(...)`[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1                 1     1     1     1     1     1     1    NA
2                 1     1     1     1    NA     1     1     1
3                 1    NA     1     1     1     1     1    NA
4                 1     1     1     1    NA    NA    NA     1
5                 1     1     1     1    NA    NA     1    NA
6                 1     1    NA     1     1     1     1     1
7                NA     1     1     1     1     1    NA     1
8                 1     1     1     1     1    NA     1     1

Filter:

df%>%filter(rowMeans(is.na(.))<0.25)

# A tibble: 4 x 1
  `matrix(...)`[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
              <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1                 1     1     1     1     1     1     1    NA
2                 1     1     1     1    NA     1     1     1
3                 1     1    NA     1     1     1     1     1
4                 1     1     1     1     1    NA     1     1

Or with rowwise():

df%>%
        rowwise()%>%
        filter(mean(is.na(c_across(everything())))<0.25)

Upvotes: 1

jared_mamrot
jared_mamrot

Reputation: 26630

I typically handle this type of task using the tidyverse and naniar packages.

Here is an example using the "airquality" built-in dataset to identify the rows with >25% NAs (missing values):

library(tidyverse)
library(naniar)

data(airquality)
dat1 <- airquality
miss_case_summary(dat1) %>% 
  filter(pct_miss >= 25)

# A tibble: 2 x 3
#   case n_miss pct_miss
#  <int>  <int>    <dbl>
#1     5      2     33.3
#2    27      2     33.3

And to exclude those two cases (rows):

dat2 <- dat1 %>% 
  slice(-c(5, 27))

If you have a large number of NAs in your dataset, you could use:

list_of_gt25_NAs <- miss_case_summary(dat1) %>% 
  filter(pct_miss >= 25)

dat2 <- dat1 %>% 
  slice(-c(list_of_gt25_NAs$case))

Upvotes: 1

Related Questions