Reputation: 99
I have a phylogenetic tree in newick format. I want to pull out a subtree based on the labels of the terminal nodes (so based on a list of species). A copy of the tree I am using can be found here: http://hgdownload.soe.ucsc.edu/goldenPath/dm6/multiz27way/dm6.27way.nh
Currently I have read in the tree using BioPython like so:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6', 'droSim1', 'droSec1', 'droYak3', 'droEre2', 'droBia2', 'droSuz1', 'droAna3', 'droBip2', 'droEug2', 'droEle2', 'droKik2', 'droTak2', 'droRho2', 'droFic2']
How would I pull out the subtree of only the species in species_list?
Upvotes: 2
Views: 2116
Reputation: 668
As @mitoRibo have pointed out in their comment to the OP, the question can be understood differently:
species_list
(which may include other clades)OR
species_list
In the example given in the OP these two are indistinguishable, as species_list
contains all the clades descendant from their common ancestor, so the answer by @mitoRibi does the trick. However, if we were interested in the second interpretation above, it wouldn't suffice. Indeed, if we limit the list to only 2 species:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
species_list = ['dm6', 'droFic2']
ca = tree.common_ancestor(species_list)
the result will be still the tree with 15 clades (the same as in the original list in the OP.)
I therefore post the solution that extracts the subtree containing only the clades in the species tree:
from Bio import Phylo
from copy import deepcopy
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6', 'droFic2']
subtree = deepcopy(tree)
for t in tree.get_terminals():
if t.name not in species_list:
subtree.prune(t.name)
with the result:
______________________________________________________________________ dm6
_|
|___________________________________________ droFic2
This is admittedly a pedestrian solution, as looping over all clades might be a bit slow for big trees. However, to my knowledge, it is the only way it can be done with BioPython. This is why the solution with ete3
, suggested by @IanFiddes, is well worth looking at.
Upvotes: 1
Reputation: 3011
Instead of using BioPython, use ete3
.
from ete3 import Tree
t = Tree('dm6.27way.nh')
t.prune(species_list, preserve_branch_length=True)
t.write()
From the documentation,
From version 2.2, this function includes also the preserve_branch_length flag, which allows to remove nodes from a tree while keeping original distances among remaining nodes.
Upvotes: 3
Reputation: 4548
Ok yeah, assuming you want the smallest tree that has all the species in your species list you want the root node of this tree to be the most recent common ancestor (MRCA) of all the species in the list which is thankfully already implemented in Phylo:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6',
'droSim1',
'droSec1',
'droYak3',
'droEre2',
'droBia2',
'droSuz1',
'droAna3',
'droBip2',
'droEug2',
'droEle2',
'droKik2',
'droTak2',
'droRho2',
'droFic2']
common_ancestor = tree.common_ancestor(species_list)
Phylo.draw_ascii(common_ancestor)
output:
Clade
___ dm6
___|
| | , droSim1
| |_|
__________| | droSec1
| |
| | _____ droYak3
,| |_|
|| |____ droEre2
||
|| _______ droBia2
||_____|
| |_____ droSuz1
|
__| _______ droAna3
| |_________________________________|
| | |________ droBip2
| |
| |___________________ droEug2
|
|_____________ droEle2
,|
||______________________________ droKik2
__||
| ||______________ droTak2
___________________| |
| |____________ droRho2
|
|_______________ droFic2
Upvotes: 5