Devon O'Rourke
Devon O'Rourke

Reputation: 267

filtering single matrix conditional upon column name regex AND element value

This is a question rooted in trying to filter one matrix. The input file type is a .txt file containing sample names as columns, unique (sequence) identifiers as rows/index, and elements of the matrix are counts of sequences observed for a given sample. If your familiar with amplicon sequencing (16-S microbiomes, ITS fungal/plant, etc.), yes, this is an OTU table, but if you haven't, it doesn't matter, I swear! Please note I can control the names of the columns of the input file so feel free to suggest a better naming convention if you see fit. I point this out only because it's my understanding that packages like Pandas have the capacity to retain multi-level naming conventions; please remember the input file is just a text file to start, so I can only 'rename' my input file with delimiters should they prove useful in downstream filtering.

I've created an example dataset in R and Python, so feel free to use the either to recreate. Apologies for not naming the R and Python file the same, but I wanted to make it easier for answers to differentiate whether you took the Python or R approach.

## Python enthusiasts:
import pandas as pd
df = pd.DataFrame(
[[1,2,0,0,0,0], [3,0,4,0,0,0], [0,0,5,6,7,0], [0,0,8,0,0,0], [0,0,0,0,0,99]], 
columns=['s1-G1', 's2-G1', 's3-G2', 's4-G2', 's5-G3', 's6-G3'],
index=['OTU1', 'OTU2', 'OTU3', 'OTU4', 'OTU5'])

## R enthusiasts:

ra <- c(1,2,0,0,0,0)
rb <- c(3,0,4,0,0,0)
rc <- c(0,0,5,6,7,0)
rd <- c(0,0,8,0,0,0)
re <- c(0,0,0,0,0,99)
mat <- rbind(ra,rb,rc,rd,re)

colnames(mat) <- c("s1-G1", "s2-G1", "s3-G2", "s4-G2", "s5-G3", "s6-G3")

rownames(mat) <- c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")

Hopefully you get a data set that looks like this:

     s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU1     1     2     0     0     0     0
OTU2     3     0     4     0     0     0
OTU3     0     0     5     6     7     0
OTU4     0     0     8     0     0     0
OTU5     0     0     0     0     0     99 

As you might have guessed, the column names actually specify two levels or organization, despite being concatenated together: a sample name, and a group name. Thus s1-G1, s2-G1, represent distinct samples from the same group, while s1-G1 and s3-G2 represent distinct samples from different groups. I point this out because the filtering goals require some sort of expression to group by the G# portion of the name, while the S# portion isn't relevant for this question.

Alright, so what's the goal? The goal is to filter this dataset such that we split this matrix into two new matricies: one called df_uniq and another called df_dupd. The specifics for each new matrix:

I want to filter the rows, such that any row in the df_uniq matrix is included when we observe one or more matrix elements with a value greater than zero for any number columns EXCEPT all of those matches are contained within only one unique G value. In the example matrix above, this would result in a new df_uniq matrix that looked like this:

     s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU1     1     2     0     0     0     0
OTU4     0     0     8     0     0     0
OTU5     0     0     0     0     0     99

because we observe values greater than zero in columns s1-G1 and s3-G2 for row OTU2 (which are in two independent groups G1 and G2), those aren't included in the resulting matrix; likewise row OTU3 is also excluded for the same reason (it contains more than one non-zero value in two groups, G2 and G3).

The other matrix, df_dupd would contain the opposite of what's in df_uniq - any row in which there is at least one non-zero element in the matrix for which there are **at least two unique G samples observed. The values from the initial matrix would produce a new table containing what we didn't include in our df_uniq table:

     s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3
OTU2     3     0     4     0     0     0
OTU3     0     0     5     6     7     0

Thank you for your consideration and support. I look forward to your responses.

Devon

Upvotes: 0

Views: 217

Answers (1)

caw5cv
caw5cv

Reputation: 721

The following example using data.table should get you started:

library(data.table)

ra <- c(1,2,0,0,0,0)
rb <- c(3,0,4,0,0,0)
rc <- c(0,0,5,6,7,0)
rd <- c(0,0,8,0,0,0)
re <- c(0,0,0,0,0,99)
mat <- rbind(ra,rb,rc,rd,re)

colnames(mat) <- c("s1-G1", "s2-G1", "s3-G2", "s4-G2", "s5-G3", "s6-G3")
rownames(mat) <- c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")

# Convert to data.table

dat.wide <- data.table(mat)

# add back in OTU as column
dat.wide[, OTU := c("OTU1", "OTU2", "OTU3", "OTU4", "OTU5")]

# Convert to long format
dat.long <- melt(dat.wide)

# split sample and group name into separate columns, creating new column "value"
dat.long[, c("sample","group") := tstrsplit(variable, split="-")]

# remove original joint sample-group column
dat.long[, variable := NULL]

# get unique OTUs
unique.OTUs <- dat.long[, list(N=sum(value)), by=list(group, OTU)][, list(Ngroups=sum(N>0)), by=OTU][Ngroups==1]$OTU

dat.wide[OTU %in% unique.OTUs]

# output:
# s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3  OTU
#     1     2     0     0     0     0 OTU1
#     0     0     8     0     0     0 OTU4
#     0     0     0     0     0    99 OTU5

dat.wide[! (OTU %in% unique.OTUs)]

# output:
# s1-G1 s2-G1 s3-G2 s4-G2 s5-G3 s6-G3  OTU
#     3     0     4     0     0     0 OTU2
#     0     0     5     6     7     0 OTU3

In greater detail: Long format table looks like

 OTU value sample group
OTU1     1     s1    G1
OTU2     3     s1    G1
OTU3     0     s1    G1
OTU4     0     s1    G1
OTU5     0     s1    G1
OTU1     2     s2    G1
OTU2     0     s2    G1
OTU3     0     s2    G1
OTU4     0     s2    G1
OTU5     0     s2    G1
OTU1     0     s3    G2
OTU2     4     s3    G2
OTU3     5     s3    G2
OTU4     8     s3    G2
OTU5     0     s3    G2
OTU1     0     s4    G2
OTU2     0     s4    G2
OTU3     6     s4    G2
OTU4     0     s4    G2
OTU5     0     s4    G2
OTU1     0     s5    G3
OTU2     0     s5    G3
OTU3     7     s5    G3
OTU4     0     s5    G3
OTU5     0     s5    G3
OTU1     0     s6    G3
OTU2     0     s6    G3
OTU3     0     s6    G3
OTU4     0     s6    G3
OTU5    99     s6    G3

Which allows us to first sum the observations for each group/OTU combo:

step1 <- dat.long[, list(N=sum(value)), by=list(group, OTU)]

giving us

group  OTU  N
   G1 OTU1  3
   G1 OTU2  3
   G1 OTU3  0
   G1 OTU4  0
   G1 OTU5  0
   G2 OTU1  0
   G2 OTU2  4
   G2 OTU3 11
   G2 OTU4  8
   G2 OTU5  0
   G3 OTU1  0
   G3 OTU2  0
   G3 OTU3  7
   G3 OTU4  0
   G3 OTU5 99

chain on [, list(Ngroups=sum(N>0)), by=OTU] to count the number of groups with nonzero observations

step2 <- step1[, list(Ngroups=sum(N>0)), by=OTU]

giving us

 OTU Ngroups
OTU1       1
OTU2       2
OTU3       2
OTU4       1
OTU5       1

lastly chain on [Ngroups==1] to select rows where nonzero observations only occur in on group

step3 <- step2[Ngroups==1]
step3

giving us

 OTU Ngroups
OTU1       1
OTU4       1
OTU5       1

Lastly, filter the original wide data.table based on the unique OTUs as we did above.

Upvotes: 1

Related Questions