adub
adub

Reputation: 73

Biopython bootstrapping phylogenetic trees with custom distance matrix

I am trying to create a bootstrapped phylogenetic tree but instead of using raw multiple sequence alignment data and a standard scoring system, I want to use my own custom distance matrix that I have created. I have currently looked at http://biopython.org/wiki/Phylo and have been able to create a single tree using my own custom distance matrix using the following code:

dm = TreeConstruction._DistanceMatrix(tfs,dmat)
treeConstructor  = DistanceTreeConstructor(method = 'upgma')
upgmaTree = treeConstructor.upgma(dm)
Phylo.draw(upgmaTree)

where dmat is a lower triangle distance matrix, and tfs is a list of names that are used for columns/rows. When looking at the bootstrapping examples, it seems like all of the input needs to be raw sequence data and not a distance matrix like I used above, does anyone know a workaround for this problem? Thanks!

Upvotes: 0

Views: 1624

Answers (1)

nya
nya

Reputation: 2250

Short answer: No, you cannot use a distance matrix to bootstrap a phylogeny.

Long answer: The first step in bootstrapping a phylogeny calls for creating a set of data pseudoreplicates. For DNA sequences, nucleotide positions are randomly drawn from the alignment (the whole column) with repetitions up to the total length of the alignment.

Let's assume a 10 bp long alignment with two sequences differing by two mutations. For simplicity sake, their distance is d = 0.2.

AATTCCGGGG
AACTCCGGAG

Bootstrapping such a dataset would call for positions 3, 8, 5, 9, 10, 1, 6, 9, 6, 5 to represent the pseudoreplicate.

set.seed(123)
sample(1:10, 10, replace = TRUE)
[1]  3  8  5  9 10  1  6  9  6  5

TGCGGACGCC
CGCAGACACC

We obtained a dataset with variables (columns) identical to the original alignment, but occurring at different frequencies. Note that d = 0.3 in the bootstrapped alignment.

Using this approach, we can bootstrap any variable or a dataset containing multiple variables. A distance matrix cannot be used in this way, because it represents already processed information.

Solution:

Repeat the process for calculating the custom distance matrix on your own data pseudoreplications.

# Your function to calculate a custom distance matrix
calc.dist <- function(dat) { ... }

nrep <- 100
reps <- lapply(1:nrep, FUN=function(i) calc.dist(dat[,sample(1:ncol(dat), ncol(dat), replace = TRUE)]))

Upvotes: 1

Related Questions