Mark
Mark

Reputation: 323

I want to use dplyr to filter a data frame if all members of a given group are absent

I have a tab delimited binary matrix of bacterial strain names and genes, listed as present (1) or absent (0), which is output by ROARY (pangenome pipeline).

This is a mock version of the data:

strain <- rep(letters[1:4], 5)
gene <- c(rep("G1", 4), rep("G2", 4), rep("G3", 4), rep("G4", 4), rep("G5", 4))
pres_abs <- c(1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1) 
test <- tibble(strain, gene, pres_abs)

   strain gene  pres_abs
   <chr>  <chr>    <dbl>
 1 a      G1           1
 2 b      G1           1
 3 c      G1           1
 4 d      G1           1
 5 a      G2           1
 6 b      G2           1
 7 c      G2           0
 8 d      G2           0
 9 a      G3           0
10 b      G3           0
11 c      G3           1
12 d      G3           1
13 a      G4           0
14 b      G4           0
15 c      G4           0
16 d      G4           1
17 a      G5           1
18 b      G5           0
19 c      G5           1
20 d      G5           1

Just as an aside, it is structured like this when I read it into R using read_tsv():

  gene      a     b     c     d
  <chr> <dbl> <dbl> <dbl> <dbl>
1 G1        1     1     1     1
2 G2        1     1     0     0
3 G3        0     0     1     1
4 G4        0     0     0     1
5 G5        1     0     1     1

There are several thousand genes and about 30 strains in my matrix.

I want to identify all the genes which are absent (0) in a certain subset of the strains and save them as a vector (list?) to use in further analysis (e.g. as a filter term for a similar data frame).

For the example above, I only want the genes which are absent in both strain a and strain b (and are therefore present in c and/or d). So I would expect to get genes G3 and G4.

Having done some searching for a solution, I have lengthened the data using pivot_longer so that it is structured like test in my example. I tried to filter like this:

test %>% filter(strain %in% c("a", "b") & pres_abs == 0)

Which gives G3 and G4 as I want, but also G5, which I don't as it's present in gene a.

  strain gene  pres_abs
  <chr>  <chr>    <dbl>
1 a      G3           0
2 b      G3           0
3 a      G4           0
4 b      G4           0
5 b      G5           0

Can someone help me with the correct filter terms?

Upvotes: 3

Views: 201

Answers (3)

zimia
zimia

Reputation: 932

Dont bother with making the data long. I changed it back to wide and filtered from there.

test %>% 
  pivot_wider(gene, names_from = strain, values_from = pres_abs, values_fill = 0) %>% 
  filter(across(c(a,b), ~ .==0)) %>% 
  pull(gene)
[1] "G3" "G4"

EDIT: show any_of with across working example

test %>% 
  pivot_wider(gene, names_from = strain, values_from = pres_abs, values_fill = 0) %>% 
  filter(across(any_of(c("a","b","z","q","t")), ~ .==0)) %>% 
  pull(gene)
[1] "G3" "G4"

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145755

Here's a way where we sum pres_abs for the strains you're interested in.

test %>%
  group_by(gene) %>%
  filter(sum(pres_abs[strain %in% c("a", "b")]) == 0)
# # A tibble: 8 x 3
# # Groups:   gene [2]
#   strain gene  pres_abs
#   <chr>  <chr>    <dbl>
# 1 a      G3           0
# 2 b      G3           0
# 3 c      G3           1
# 4 d      G3           1
# 5 a      G4           0
# 6 b      G4           0
# 7 c      G4           0
# 8 d      G4           1

The above returns all observations for those strains. Alternately, you could do a two-step filter:

test %>%
  group_by(gene) %>%
  filter(strain %in% c("a", "b")) %>%
  filter(sum(pres_abs) == 0)
# # A tibble: 4 x 3
# # Groups:   gene [2]
#   strain gene  pres_abs
#   <chr>  <chr>    <dbl>
# 1 a      G3           0
# 2 b      G3           0
# 3 a      G4           0
# 4 b      G4           0

Upvotes: 3

akrun
akrun

Reputation: 886938

We can do a group by operation i.e. grouped by 'gene', check if all the 'a', 'b' are found in the 'strain' where the 'pres_abs' value is 0 and to avoid getting the 1 values in pres_abs, create a second condition i.e. 'pres_abs' as 0

library(dplyr)
test %>%
     group_by(gene) %>% 
     filter(all(c("a", "b") %in% strain[pres_abs == 0]),
               pres_abs == 0) %>%
     ungroup

-ouptut

# A tibble: 5 x 3
  strain gene  pres_abs
  <chr>  <chr>    <dbl>
1 a      G3           0
2 b      G3           0
3 a      G4           0
4 b      G4           0
5 c      G4           0

If we need the 1 values as well,

test %>%
     group_by(gene) %>% 
     filter(all(c("a", "b") %in% strain[pres_abs == 0])) %>%
     ungroup

-output

# A tibble: 8 x 3
  strain gene  pres_abs
  <chr>  <chr>    <dbl>
1 a      G3           0
2 b      G3           0
3 c      G3           1
4 d      G3           1
5 a      G4           0
6 b      G4           0
7 c      G4           0
8 d      G4           1

Upvotes: 2

Related Questions