Reputation: 145
I'm trying to drop columns that have more than 90% of NA values present, I've followed the following but I only get a values in return, not sure what I can be doing wrong. I would be expecting an actual data frame, I tried putting as.data.frame in front but this is also erroneous.
Linked Post: Delete columns/rows with more than x% missing
gene cell1 cell2 cell3
A 0.4 0.1 NA
B NA NA 0.1
C 0.4 NA 0.5
D NA NA 0.5
E 0.5 NA 0.6
F 0.6 NA NA
gene cell1 cell3
A 0.4 NA
B NA 0.1
C 0.4 0.5
D NA 0.5
E 0.5 0.6
F 0.6 NA
#Select Genes that have NA values for 90% of a given cell line
df_col <- df[,2:ncol(df)]
df_col <-df_col[, which(colMeans(!is.na(df_col)) > 0.9)]
df <- cbind(df[,1], df_col)
Upvotes: 3
Views: 133
Reputation: 9858
I would use dplyr
here.
If you want to use select()
with logical conditions, you are probably looking for the where()
selection helper in dplyr
.
It can be used like this: select(where(condition))
I used a 80% threshold because 90% would keep all columns and would therefore not illustrate the solution as well
library(dplyr)
df %>% select(where(~mean(is.na(.))<0.8))
It can also be done with base R and colMeans:
df[, c(TRUE, colMeans(is.na(df[-1]))<0.8)]
or with purrr:
library(purrr)
df %>% keep(~mean(is.na(.))<0.8)
Output:
gene cell1 cell3
1 a 0.4 NA
2 b NA 0.1
3 c 0.4 0.5
4 d NA 0.5
5 e 0.5 0.6
6 f 0.6 NA
Data
df<-data.frame(gene=letters[1:6],
cell1=c(0.4, NA, 0.4, NA, 0.5, 0.6),
cell2=c(0.1, rep(NA, 5)),
cell3=c(NA, 0.1, 0.5, 0.5, 0.6, NA))
Upvotes: 5
Reputation: 388817
Well, cell3
has 83% NA
values (5/6) but anyway you can do -
ignore <- 1
perc <- 0.8 #80 %
df <- cbind(df[ignore], df[-ignore][colMeans(is.na(df[-ignore])) < perc])
df
# gene cell1 cell3
#1 A 0.4 NA
#2 B NA 0.1
#3 C 0.4 0.5
#4 D NA 0.5
#5 E 0.5 0.6
#6 F 0.6 NA
Upvotes: 1