Reputation: 323
I have a tab delimited binary matrix of bacterial strain names and genes, listed as present (1) or absent (0), which is output by ROARY (pangenome pipeline).
This is a mock version of the data:
strain <- rep(letters[1:4], 5)
gene <- c(rep("G1", 4), rep("G2", 4), rep("G3", 4), rep("G4", 4), rep("G5", 4))
pres_abs <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1)
test <- tibble(strain, gene, pres_abs)
strain gene pres_abs
<chr> <chr> <dbl>
1 a G1 1
2 b G1 1
3 c G1 1
4 d G1 1
5 a G2 1
6 b G2 1
7 c G2 0
8 d G2 0
9 a G3 0
10 b G3 0
11 c G3 1
12 d G3 1
13 a G4 0
14 b G4 0
15 c G4 0
16 d G4 1
17 a G5 1
18 b G5 0
19 c G5 1
20 d G5 1
Just as an aside, it is structured like this when I read it into R using read_tsv()
:
gene a b c d
<chr> <dbl> <dbl> <dbl> <dbl>
1 G1 1 1 1 1
2 G2 1 1 0 0
3 G3 0 0 1 1
4 G4 0 0 0 1
5 G5 1 0 1 1
There are several thousand genes and about 30 strains in my matrix.
I want to identify all the genes which are absent (0) in a certain subset of the strains and save them as a vector (list?) to use in further analysis (e.g. as a filter term for a similar data frame).
For the example above, I only want the genes which are absent in both strain a and strain b (and are therefore present in c and/or d). So I would expect to get genes G3 and G4.
Having done some searching for a solution, I have lengthened the data using pivot_longer
so that it is structured like test
in my example. I tried to filter like this:
test %>% filter(strain %in% c("a", "b") & pres_abs == 0)
Which gives G3 and G4 as I want, but also G5, which I don't as it's present in gene a.
strain gene pres_abs
<chr> <chr> <dbl>
1 a G3 0
2 b G3 0
3 a G4 0
4 b G4 0
5 b G5 0
Can someone help me with the correct filter terms?
Upvotes: 3
Views: 201
Reputation: 932
Dont bother with making the data long. I changed it back to wide and filtered from there.
test %>%
pivot_wider(gene, names_from = strain, values_from = pres_abs, values_fill = 0) %>%
filter(across(c(a,b), ~ .==0)) %>%
pull(gene)
[1] "G3" "G4"
EDIT: show any_of with across working example
test %>%
pivot_wider(gene, names_from = strain, values_from = pres_abs, values_fill = 0) %>%
filter(across(any_of(c("a","b","z","q","t")), ~ .==0)) %>%
pull(gene)
[1] "G3" "G4"
Upvotes: 2
Reputation: 145755
Here's a way where we sum pres_abs
for the strains you're interested in.
test %>%
group_by(gene) %>%
filter(sum(pres_abs[strain %in% c("a", "b")]) == 0)
# # A tibble: 8 x 3
# # Groups: gene [2]
# strain gene pres_abs
# <chr> <chr> <dbl>
# 1 a G3 0
# 2 b G3 0
# 3 c G3 1
# 4 d G3 1
# 5 a G4 0
# 6 b G4 0
# 7 c G4 0
# 8 d G4 1
The above returns all observations for those strains. Alternately, you could do a two-step filter:
test %>%
group_by(gene) %>%
filter(strain %in% c("a", "b")) %>%
filter(sum(pres_abs) == 0)
# # A tibble: 4 x 3
# # Groups: gene [2]
# strain gene pres_abs
# <chr> <chr> <dbl>
# 1 a G3 0
# 2 b G3 0
# 3 a G4 0
# 4 b G4 0
Upvotes: 3
Reputation: 886938
We can do a group by operation i.e. grouped by 'gene', check if all
the 'a', 'b' are found in the 'strain' where the 'pres_abs' value is 0 and to avoid getting the 1 values in pres_abs, create a second condition i.e. 'pres_abs' as 0
library(dplyr)
test %>%
group_by(gene) %>%
filter(all(c("a", "b") %in% strain[pres_abs == 0]),
pres_abs == 0) %>%
ungroup
-ouptut
# A tibble: 5 x 3
strain gene pres_abs
<chr> <chr> <dbl>
1 a G3 0
2 b G3 0
3 a G4 0
4 b G4 0
5 c G4 0
If we need the 1 values as well,
test %>%
group_by(gene) %>%
filter(all(c("a", "b") %in% strain[pres_abs == 0])) %>%
ungroup
-output
# A tibble: 8 x 3
strain gene pres_abs
<chr> <chr> <dbl>
1 a G3 0
2 b G3 0
3 c G3 1
4 d G3 1
5 a G4 0
6 b G4 0
7 c G4 0
8 d G4 1
Upvotes: 2