use R to filter out samples below a threshold

Question

I have a table of data with multiple columns ID; A; B; C; D (A-D are samples) I want to be able to scan the table for any ID present in at least one sample at >0.01% of the total sample. current table looks like

ID;         A;        B;     C;     D;
bacteria1;  0.00;   2.54;   0.10;   0.54;
bacteria2;  0.50;   0.04;   0.20;   0.03;
bacteria3;  0.08;   0.05;   0.08;   0.04;
bacteria4;  0.009;  0.005;  0.008;  0.004;

SO in the above example I would like to filter out the bacteria4 into a new file and end up with 2 files, one with a list of bacteria that are present in at least one sample >0.01% threshold and the other that has only bacteria that are under that threshold in all samples.

File 1 over0.01prec
ID;         A;        B;     C;     D;
bacteria1;  0.00;   2.54;   0.10;   0.54;
bacteria2;  0.50;   0.04;   0.20;   0.03;
bacteria3;  0.08;   0.05;   0.08;   0.04;

File 2 under0.01prec
ID;         A;        B;     C;     D;
bacteria4;  0.009;  0.005;  0.008;  0.004;

Does that make sense? Thanks so much for your help in this matter, I am just attempting to set a threshold on the data, and I have tried several ways and finally decided this is the best format to work with, but I could have it in a gathered format as well, if that is easier for the coding...

thanks Anna

neilfws · Accepted Answer

I assume, based on your example output, that the values in your example data are already percentages even though they don't sum to 100% per sample.

I would caution against creating multiple files based on criteria. It's much easier to keep everything in one data frame and flag rows for filtering.

Given your data I would gather it to a long form and flag those samples where the value is above the threshold.

You can then determine which ID are below the threshold in every sample by grouping on ID and finding the maximum value of the flag.

library(dplyr)
library(tidyr)

mydata <- read.table(text = "ID         A        B     C     D
  bacteria1  0.00   2.54   0.10   0.54
  bacteria2  0.50   0.04   0.20   0.03
  bacteria3  0.08   0.05   0.08   0.04
  bacteria4  0.009  0.005  0.008  0.004", header = TRUE)

Flag values:

mydata %>% 
  gather("Sample", "Value", -ID) %>% 
  arrange(ID) %>% 
  mutate(Flag = ifelse(Value > 0.01, 1, 0))

          ID Sample Value Flag
1  bacteria1      A 0.000    0
2  bacteria1      B 2.540    1
3  bacteria1      C 0.100    1
4  bacteria1      D 0.540    1
5  bacteria2      A 0.500    1
6  bacteria2      B 0.040    1
7  bacteria2      C 0.200    1
8  bacteria2      D 0.030    1
9  bacteria3      A 0.080    1
10 bacteria3      B 0.050    1
11 bacteria3      C 0.080    1
12 bacteria3      D 0.040    1
13 bacteria4      A 0.009    0
14 bacteria4      B 0.005    0
15 bacteria4      C 0.008    0
16 bacteria4      D 0.004    0

As above with group and summary:

mydata %>% 
  gather("Sample", "Value", -ID) %>% 
  arrange(ID) %>% 
  mutate(Flag = ifelse(Value > 0.01, 1, 0)) %>% 
  group_by(ID) %>% 
  summarise(anyFlagged = max(Flag))

# A tibble: 4 x 2
  ID        anyFlagged
            
1 bacteria1          1
2 bacteria2          1
3 bacteria3          1
4 bacteria4          0

use R to filter out samples below a threshold

Answers (2)

Related Questions