Anna
Anna

Reputation: 13

use R to filter out samples below a threshold

I have a table of data with multiple columns ID; A; B; C; D (A-D are samples) I want to be able to scan the table for any ID present in at least one sample at >0.01% of the total sample. current table looks like

ID;         A;        B;     C;     D;
bacteria1;  0.00;   2.54;   0.10;   0.54;
bacteria2;  0.50;   0.04;   0.20;   0.03;
bacteria3;  0.08;   0.05;   0.08;   0.04;
bacteria4;  0.009;  0.005;  0.008;  0.004;

SO in the above example I would like to filter out the bacteria4 into a new file and end up with 2 files, one with a list of bacteria that are present in at least one sample >0.01% threshold and the other that has only bacteria that are under that threshold in all samples.

File 1 over0.01prec
ID;         A;        B;     C;     D;
bacteria1;  0.00;   2.54;   0.10;   0.54;
bacteria2;  0.50;   0.04;   0.20;   0.03;
bacteria3;  0.08;   0.05;   0.08;   0.04;

File 2 under0.01prec
ID;         A;        B;     C;     D;
bacteria4;  0.009;  0.005;  0.008;  0.004;

Does that make sense? Thanks so much for your help in this matter, I am just attempting to set a threshold on the data, and I have tried several ways and finally decided this is the best format to work with, but I could have it in a gathered format as well, if that is easier for the coding...

thanks Anna

Upvotes: 1

Views: 1341

Answers (2)

neilfws
neilfws

Reputation: 33782

I assume, based on your example output, that the values in your example data are already percentages even though they don't sum to 100% per sample.

I would caution against creating multiple files based on criteria. It's much easier to keep everything in one data frame and flag rows for filtering.

Given your data I would gather it to a long form and flag those samples where the value is above the threshold.

You can then determine which ID are below the threshold in every sample by grouping on ID and finding the maximum value of the flag.

library(dplyr)
library(tidyr)

mydata <- read.table(text = "ID         A        B     C     D
  bacteria1  0.00   2.54   0.10   0.54
  bacteria2  0.50   0.04   0.20   0.03
  bacteria3  0.08   0.05   0.08   0.04
  bacteria4  0.009  0.005  0.008  0.004", header = TRUE)

Flag values:

mydata %>% 
  gather("Sample", "Value", -ID) %>% 
  arrange(ID) %>% 
  mutate(Flag = ifelse(Value > 0.01, 1, 0))

          ID Sample Value Flag
1  bacteria1      A 0.000    0
2  bacteria1      B 2.540    1
3  bacteria1      C 0.100    1
4  bacteria1      D 0.540    1
5  bacteria2      A 0.500    1
6  bacteria2      B 0.040    1
7  bacteria2      C 0.200    1
8  bacteria2      D 0.030    1
9  bacteria3      A 0.080    1
10 bacteria3      B 0.050    1
11 bacteria3      C 0.080    1
12 bacteria3      D 0.040    1
13 bacteria4      A 0.009    0
14 bacteria4      B 0.005    0
15 bacteria4      C 0.008    0
16 bacteria4      D 0.004    0

As above with group and summary:

mydata %>% 
  gather("Sample", "Value", -ID) %>% 
  arrange(ID) %>% 
  mutate(Flag = ifelse(Value > 0.01, 1, 0)) %>% 
  group_by(ID) %>% 
  summarise(anyFlagged = max(Flag))

# A tibble: 4 x 2
  ID        anyFlagged
  <chr>          <dbl>
1 bacteria1          1
2 bacteria2          1
3 bacteria3          1
4 bacteria4          0

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 389145

You can remove the first column and calculate 0.01% of all the values (val). You can then compare val with the dataframe and divide data into two parts. One which has at least one row greater than val and another which has no row greater than val.

val <- sum(unlist(df[-1])) * 0.01
sum_val <- rowSums(df[-1] > val)
df1 <- df[sum_val > 0, ]
df2 <- df[sum_val == 0, ]
df1
#         ID    A    B    C    D
#1 bacteria1 0.00 2.54 0.10 0.54
#2 bacteria2 0.50 0.04 0.20 0.03
#3 bacteria3 0.08 0.05 0.08 0.04

df2
#         ID     A     B     C     D
#4 bacteria4 0.009 0.005 0.008 0.004

You can use write.csv to write the data as csv.

write.csv(df1, 'over.csv', row.names = FALSE)
write.csv(df2, 'under.csv', row.names = FALSE)

data

df <- structure(list(ID = c("bacteria1", "bacteria2", "bacteria3", 
"bacteria4"), A = c(0, 0.5, 0.08, 0.009), B = c(2.54, 0.04, 0.05, 
0.005), C = c(0.1, 0.2, 0.08, 0.008), D = c(0.54, 0.03, 0.04, 
0.004)), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 2

Related Questions