Reputation: 49
I need to filter the data frame below according to the number of samples each otu
occurs in.
samples otu1 otu2 otu3 otu4 otu5
1 a 2 1 0 0 3
2 b 2 4 1 4 3
3 c 0 0 0 1 0
4 d 0 0 1 4 4
5 e 1 2 0 2 3
6 f 1 1 2 4 2
7 g 1 0 0 4 3
8 h 0 0 2 0 4
9 i 1 2 2 1 6
10 j 0 0 2 3 4
For example, to keep only the otu
s that occur in >=80% of the samples, the output would be like:
samples otu4 otu5
1 a 0 3
2 b 4 3
3 c 1 0
4 d 4 4
5 e 2 3
6 f 4 2
7 g 4 3
8 h 0 4
9 i 1 6
10 j 3 4
Upvotes: 0
Views: 40
Reputation: 886998
We can use select
library(dplyr)
df1 %>%
select(samples, where(~ is.numeric(.) && mean(. != 0) >= 0.8))
-output
# samples otu4 otu5
#1 a 0 3
#2 b 4 3
#3 c 1 0
#4 d 4 4
#5 e 2 3
#6 f 4 2
#7 g 4 3
#8 h 0 4
#9 i 1 6
#10 j 3 4
Or if we are using an older dplyr
version, use select_if
df1 %>%
select_if(~ is.character(.)|is.numeric(.) && mean(. != 0) >= 0.8)
df1 <- structure(list(samples = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), otu1 = c(2L, 2L, 0L, 0L, 1L, 1L, 1L, 0L, 1L,
0L), otu2 = c(1L, 4L, 0L, 0L, 2L, 1L, 0L, 0L, 2L, 0L), otu3 = c(0L,
1L, 0L, 1L, 0L, 2L, 0L, 2L, 2L, 2L), otu4 = c(0L, 4L, 1L, 4L,
2L, 4L, 4L, 0L, 1L, 3L), otu5 = c(3L, 3L, 0L, 4L, 3L, 2L, 3L,
4L, 6L, 4L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10"))
Upvotes: 2