Reputation: 13
I have a table of data with multiple columns ID; A; B; C; D (A-D are samples) I want to be able to scan the table for any ID present in at least one sample at >0.01% of the total sample. current table looks like
ID; A; B; C; D;
bacteria1; 0.00; 2.54; 0.10; 0.54;
bacteria2; 0.50; 0.04; 0.20; 0.03;
bacteria3; 0.08; 0.05; 0.08; 0.04;
bacteria4; 0.009; 0.005; 0.008; 0.004;
SO in the above example I would like to filter out the bacteria4 into a new file and end up with 2 files, one with a list of bacteria that are present in at least one sample >0.01% threshold and the other that has only bacteria that are under that threshold in all samples.
File 1 over0.01prec
ID; A; B; C; D;
bacteria1; 0.00; 2.54; 0.10; 0.54;
bacteria2; 0.50; 0.04; 0.20; 0.03;
bacteria3; 0.08; 0.05; 0.08; 0.04;
File 2 under0.01prec
ID; A; B; C; D;
bacteria4; 0.009; 0.005; 0.008; 0.004;
Does that make sense? Thanks so much for your help in this matter, I am just attempting to set a threshold on the data, and I have tried several ways and finally decided this is the best format to work with, but I could have it in a gathered format as well, if that is easier for the coding...
thanks Anna
Upvotes: 1
Views: 1341
Reputation: 33782
I assume, based on your example output, that the values in your example data are already percentages even though they don't sum to 100% per sample.
I would caution against creating multiple files based on criteria. It's much easier to keep everything in one data frame and flag rows for filtering.
Given your data I would gather it to a long form and flag those samples where the value is above the threshold.
You can then determine which ID are below the threshold in every sample by grouping on ID and finding the maximum value of the flag.
library(dplyr)
library(tidyr)
mydata <- read.table(text = "ID A B C D
bacteria1 0.00 2.54 0.10 0.54
bacteria2 0.50 0.04 0.20 0.03
bacteria3 0.08 0.05 0.08 0.04
bacteria4 0.009 0.005 0.008 0.004", header = TRUE)
Flag values:
mydata %>%
gather("Sample", "Value", -ID) %>%
arrange(ID) %>%
mutate(Flag = ifelse(Value > 0.01, 1, 0))
ID Sample Value Flag
1 bacteria1 A 0.000 0
2 bacteria1 B 2.540 1
3 bacteria1 C 0.100 1
4 bacteria1 D 0.540 1
5 bacteria2 A 0.500 1
6 bacteria2 B 0.040 1
7 bacteria2 C 0.200 1
8 bacteria2 D 0.030 1
9 bacteria3 A 0.080 1
10 bacteria3 B 0.050 1
11 bacteria3 C 0.080 1
12 bacteria3 D 0.040 1
13 bacteria4 A 0.009 0
14 bacteria4 B 0.005 0
15 bacteria4 C 0.008 0
16 bacteria4 D 0.004 0
As above with group and summary:
mydata %>%
gather("Sample", "Value", -ID) %>%
arrange(ID) %>%
mutate(Flag = ifelse(Value > 0.01, 1, 0)) %>%
group_by(ID) %>%
summarise(anyFlagged = max(Flag))
# A tibble: 4 x 2
ID anyFlagged
<chr> <dbl>
1 bacteria1 1
2 bacteria2 1
3 bacteria3 1
4 bacteria4 0
Upvotes: 0
Reputation: 389145
You can remove the first column and calculate 0.01% of all the values (val
). You can then compare val
with the dataframe and divide data into two parts. One which has at least one row greater than val
and another which has no row greater than val
.
val <- sum(unlist(df[-1])) * 0.01
sum_val <- rowSums(df[-1] > val)
df1 <- df[sum_val > 0, ]
df2 <- df[sum_val == 0, ]
df1
# ID A B C D
#1 bacteria1 0.00 2.54 0.10 0.54
#2 bacteria2 0.50 0.04 0.20 0.03
#3 bacteria3 0.08 0.05 0.08 0.04
df2
# ID A B C D
#4 bacteria4 0.009 0.005 0.008 0.004
You can use write.csv
to write the data as csv.
write.csv(df1, 'over.csv', row.names = FALSE)
write.csv(df2, 'under.csv', row.names = FALSE)
data
df <- structure(list(ID = c("bacteria1", "bacteria2", "bacteria3",
"bacteria4"), A = c(0, 0.5, 0.08, 0.009), B = c(2.54, 0.04, 0.05,
0.005), C = c(0.1, 0.2, 0.08, 0.008), D = c(0.54, 0.03, 0.04,
0.004)), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 2