Reputation: 267
This is a question rooted in trying to filter one matrix. The input file type is a .txt file containing sample names as columns, unique (sequence) identifiers as rows/index, and elements of the matrix are counts of sequences observed for a given sample. If your familiar with amplicon sequencing (16-S microbiomes, ITS fungal/plant, etc.), yes, this is an OTU table, but if you haven't, it doesn't matter, I swear! Please note I can control the names of the columns of the input file so feel free to suggest a better naming convention if you see fit. I point this out only because it's my understanding that packages like Pandas have the capacity to retain multi-level naming conventions; please remember the input file is just a text file to start, so I can only 'rename' my input file with delimiters should they prove useful in downstream filtering.
I've created an example dataset in R and Python, so feel free to use the either to recreate. Apologies for not naming the R and Python file the same, but I wanted to make it easier for answers to differentiate whether you took the Python or R approach.
## Python enthusiasts:
import pandas as pd
df = pd.DataFrame(
[[1,2,0,0,0,0], [3,0,4,0,0,0], [0,0,5,6,7,0], [0,0,8,0,0,0], [0,0,0,0,0,99]],
columns=['s1-G1', 's2-G1', 's3-G2', 's4-G2', 's5-G3', 's6-G3'],
index=['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5'])
## R enthusiasts:
ra <- c(1,2,0,0,0,0)
rb <- c(3,0,4,0,0,0)
rc <- c(0,0,5,6,7,0)
rd <- c(0,0,8,0,0,0)
re <- c(0,0,0,0,0,99)
mat <- rbind(ra,rb,rc,rd,re)
colnames(mat) <- c("s1-G1", "s2-G1", "s3-G2", "s4-G2", "s5-G3", "s6-G3")
rownames(mat) <- c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")
Hopefully you get a data set that looks like this:
s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU1 1 2 0 0 0 0
OTU2 3 0 4 0 0 0
OTU3 0 0 5 6 7 0
OTU4 0 0 8 0 0 0
OTU5 0 0 0 0 0 99
As you might have guessed, the column names actually specify two levels or organization, despite being concatenated together: a sample name, and a group name. Thus s1-G1
, s2-G1
, represent distinct samples from the same group, while s1-G1
and s3-G2
represent distinct samples from different groups. I point this out because the filtering goals require some sort of expression to group by the G#
portion of the name, while the S#
portion isn't relevant for this question.
Alright, so what's the goal? The goal is to filter this dataset such that we split this matrix into two new matricies: one called df_uniq
and another called df_dupd
. The specifics for each new matrix:
I want to filter the rows, such that any row in the df_uniq
matrix is included when we observe one or more matrix elements with a value greater than zero for any number columns EXCEPT all of those matches are contained within only one unique G
value. In the example matrix above, this would result in a new df_uniq
matrix that looked like this:
s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU1 1 2 0 0 0 0
OTU4 0 0 8 0 0 0
OTU5 0 0 0 0 0 99
because we observe values greater than zero in columns s1-G1
and s3-G2
for row OTU2 (which are in two independent groups G1
and G2
), those aren't included in the resulting matrix; likewise row OTU3 is also excluded for the same reason (it contains more than one non-zero value in two groups, G2
and G3
).
The other matrix, df_dupd
would contain the opposite of what's in df_uniq
- any row in which there is at least one non-zero element in the matrix for which there are **at least two unique G
samples observed. The values from the initial matrix would produce a new table containing what we didn't include in our df_uniq
table:
s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU2 3 0 4 0 0 0
OTU3 0 0 5 6 7 0
Thank you for your consideration and support. I look forward to your responses.
Devon
Upvotes: 0
Views: 217
Reputation: 721
The following example using data.table should get you started:
library(data.table)
ra <- c(1,2,0,0,0,0)
rb <- c(3,0,4,0,0,0)
rc <- c(0,0,5,6,7,0)
rd <- c(0,0,8,0,0,0)
re <- c(0,0,0,0,0,99)
mat <- rbind(ra,rb,rc,rd,re)
colnames(mat) <- c("s1-G1", "s2-G1", "s3-G2", "s4-G2", "s5-G3", "s6-G3")
rownames(mat) <- c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")
# Convert to data.table
dat.wide <- data.table(mat)
# add back in OTU as column
dat.wide[, OTU := c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")]
# Convert to long format
dat.long <- melt(dat.wide)
# split sample and group name into separate columns, creating new column "value"
dat.long[, c("sample","group") := tstrsplit(variable, split="-")]
# remove original joint sample-group column
dat.long[, variable := NULL]
# get unique OTUs
unique.OTUs <- dat.long[, list(N=sum(value)), by=list(group, OTU)][, list(Ngroups=sum(N>0)), by=OTU][Ngroups==1]$OTU
dat.wide[OTU %in% unique.OTUs]
# output:
# s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3 OTU
# 1 2 0 0 0 0 OTU1
# 0 0 8 0 0 0 OTU4
# 0 0 0 0 0 99 OTU5
dat.wide[! (OTU %in% unique.OTUs)]
# output:
# s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3 OTU
# 3 0 4 0 0 0 OTU2
# 0 0 5 6 7 0 OTU3
In greater detail: Long format table looks like
OTU value sample group
OTU1 1 s1 G1
OTU2 3 s1 G1
OTU3 0 s1 G1
OTU4 0 s1 G1
OTU5 0 s1 G1
OTU1 2 s2 G1
OTU2 0 s2 G1
OTU3 0 s2 G1
OTU4 0 s2 G1
OTU5 0 s2 G1
OTU1 0 s3 G2
OTU2 4 s3 G2
OTU3 5 s3 G2
OTU4 8 s3 G2
OTU5 0 s3 G2
OTU1 0 s4 G2
OTU2 0 s4 G2
OTU3 6 s4 G2
OTU4 0 s4 G2
OTU5 0 s4 G2
OTU1 0 s5 G3
OTU2 0 s5 G3
OTU3 7 s5 G3
OTU4 0 s5 G3
OTU5 0 s5 G3
OTU1 0 s6 G3
OTU2 0 s6 G3
OTU3 0 s6 G3
OTU4 0 s6 G3
OTU5 99 s6 G3
Which allows us to first sum the observations for each group/OTU combo:
step1 <- dat.long[, list(N=sum(value)), by=list(group, OTU)]
giving us
group OTU N
G1 OTU1 3
G1 OTU2 3
G1 OTU3 0
G1 OTU4 0
G1 OTU5 0
G2 OTU1 0
G2 OTU2 4
G2 OTU3 11
G2 OTU4 8
G2 OTU5 0
G3 OTU1 0
G3 OTU2 0
G3 OTU3 7
G3 OTU4 0
G3 OTU5 99
chain on [, list(Ngroups=sum(N>0)), by=OTU] to count the number of groups with nonzero observations
step2 <- step1[, list(Ngroups=sum(N>0)), by=OTU]
giving us
OTU Ngroups
OTU1 1
OTU2 2
OTU3 2
OTU4 1
OTU5 1
lastly chain on [Ngroups==1] to select rows where nonzero observations only occur in on group
step3 <- step2[Ngroups==1]
step3
giving us
OTU Ngroups
OTU1 1
OTU4 1
OTU5 1
Lastly, filter the original wide data.table based on the unique OTUs as we did above.
Upvotes: 1