Reputation: 1
I would like to delete all rows which contain more than 25% NAs in multiple columns. Due to the fact that this involves around 120 corresponding columns it would come in handy not to specify all of them. Preferably rather something like: everything between columnA and columnZ.
I have tried working with:
data[!is.na(data$ColumnA:data$ColumnZ), > 0.25]
But it only showed this error:
Must subset columns with a valid subscript vector.
Logical subscripts must match the size of the indexed input.
x Input has size 250 but subscript !is.na(data$ColumnA:data$ColumnZ) > 0.25 has size 3.
(250 is the actual amount of columns in the dataset)
I have also thought about trying drop_na but this results in a similar issue.
Do you have any suggestions? Thank you in advance
Upvotes: 0
Views: 321
Reputation: 9858
You can use rowMeans(is.na(.)
Data:
set.seed(1)
df<-tibble(matrix(sample(c(1, NA), 64, replace=TRUE, prob = c(0.75, 0.25)), nrow=8))
# A tibble: 8 x 1
`matrix(...)`[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 NA
2 1 1 1 1 NA 1 1 1
3 1 NA 1 1 1 1 1 NA
4 1 1 1 1 NA NA NA 1
5 1 1 1 1 NA NA 1 NA
6 1 1 NA 1 1 1 1 1
7 NA 1 1 1 1 1 NA 1
8 1 1 1 1 1 NA 1 1
Filter:
df%>%filter(rowMeans(is.na(.))<0.25)
# A tibble: 4 x 1
`matrix(...)`[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 1 NA
2 1 1 1 1 NA 1 1 1
3 1 1 NA 1 1 1 1 1
4 1 1 1 1 1 NA 1 1
Or with rowwise():
df%>%
rowwise()%>%
filter(mean(is.na(c_across(everything())))<0.25)
Upvotes: 1
Reputation: 26630
I typically handle this type of task using the tidyverse and naniar packages.
Here is an example using the "airquality" built-in dataset to identify the rows with >25% NAs (missing values):
library(tidyverse)
library(naniar)
data(airquality)
dat1 <- airquality
miss_case_summary(dat1) %>%
filter(pct_miss >= 25)
# A tibble: 2 x 3
# case n_miss pct_miss
# <int> <int> <dbl>
#1 5 2 33.3
#2 27 2 33.3
And to exclude those two cases (rows):
dat2 <- dat1 %>%
slice(-c(5, 27))
If you have a large number of NAs in your dataset, you could use:
list_of_gt25_NAs <- miss_case_summary(dat1) %>%
filter(pct_miss >= 25)
dat2 <- dat1 %>%
slice(-c(list_of_gt25_NAs$case))
Upvotes: 1