Reputation: 373
I wish to generate a graph that visualizes the structure of an xml file.
I created a list of nodes to represent the xml file.
Each node contains 3 strings: the xml tag, attribute and content.
The xml file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<entry db="genbank">
<data id="AC116785" length="132912" molecule="DNA" data_class="linear" division="HTG" date="08-JUL-2002" />
<definition>
<description>Mus musculus clone RP24-146B1, WORKING DRAFT SEQUENCE, 10 ordered pieces.</description>
</definition>
<accession>AC116785</accession>
<version>
<version_number>AC116785.3</version_number>
<gi>21703640</gi>
</version>
<keywords>
<keyword>HTG</keyword>
<keyword>HTGS_PHASE2</keyword>
<keyword>HTGS_DRAFT</keyword>
<keyword>HTGS_FULLTOP</keyword>
</keywords>
<source>
<abbreviation>house mouse.</abbreviation>
<organism>
<name>Mus musculus</name>
<taxonomy>
<class>Eukaryota</class>
<class>Metazoa</class>
<class>Chordata</class>
<class>Craniata</class>
<class>Vertebrata</class>
<class>Euteleostomi</class>
<class>Mammalia</class>
<class>Eutheria</class>
<class>Rodentia</class>
<class>Sciurognathi</class>
<class>Muridae</class>
<class>Murinae</class>
<class>Mus</class>
</taxonomy>
</organism>
</source>
<references>
<reference number="1" from="1" to="132912">
<authors>
<author>Birren,B.</author>
</authors>
<title>Mus musculus, clone RP24-146B1</title>
<journal>
<location>Unpublished</location>
</journal>
</reference>
<reference number="2" from="1" to="132912">
<authors>
<author>Birren,B.</author>
</authors>
<title>Direct Submission</title>
<journal>
<submission>02-APR-2002</submission>
<department>Whitehead Institute/MIT Center for Genome Research, 320 Charles Street, Cambridge, MA 02141, USA</department>
</journal>
</reference>
<reference number="3" from="1" to="132912">
<authors>
<author>Birren,B.</author>
</authors>
<title>Direct Submission</title>
<journal>
<submission>08-JUL-2002</submission>
<department>Whitehead Institute/MIT Center for Genome Research, 320 Charles Street, Cambridge, MA 02141, USA</department>
</journal>
</reference>
</references>
<comment>
<replaced>
<date>Jul 8, 2002</date>
<gi>21700645</gi>
</replaced>
<information title="All repeats were identified using RepeatMasker">Smit, A.F.A. , Green, P. (1996-1997)http://ftp.genome.washington.edu/RM/RepeatMasker.html</information>
<information title="Center">Whitehead Institute/ MIT Center for Genome Research</information>
<information title="Center code">WIBR</information>
<information title="Web site">http://www-seq.wi.mit.edu</information>
<information title="Contact">[email protected]</information>
<information title="Center project name">L25104</information>
<information title="Center clone name">146_B_1</information>
<information title="Sequencing vector">Plasmid; n/a; 100% of reads</information>
<information title="Chemistry">Dye-terminator Big Dye; 100% of reads</information>
<information title="Assembly program">Phrap; version 0.960731</information>
<information title="Consensus quality">130058 bases at least Q40</information>
<information title="Consensus quality">131186 bases at least Q30</information>
<information title="Consensus quality">131595 bases at least Q20</information>
<information title="Insert size">142000; agarose-fp</information>
<information title="Insert size">132012; sum-of-contigs</information>
<information title="Quality coverage">6.9 in Q20 bases; agarose-fp</information>
<information title="Quality coverage">7.5 in Q20 bases; sum-of-contigs</information>
<information title="NOTE">This is a 'working draft' sequence. It currently consists of 10 contigs. Gaps between the contigsare represented as runs of N. The order of the piecesis believed to be correct as given, however the sizesof the gaps between them are based on estimates that haveprovided by the submittor.This sequence will be replacedby the finished sequence as soon as it is available andthe accession number will be preserved.</information>
<information title="1 1178">contig of 1178 bp in length</information>
<information title="1179 1278">gap of 100 bp</information>
<information title="1279 2835">contig of 1557 bp in length</information>
<information title="2836 2935">gap of 100 bp</information>
<information title="2936 5385">contig of 2450 bp in length</information>
<information title="5386 5485">gap of 100 bp</information>
<information title="5486 8192">contig of 2707 bp in length</information>
<information title="8193 8292">gap of 100 bp</information>
<information title="8293 10488">contig of 2196 bp in length</information>
<information title="10489 10588">gap of 100 bp</information>
<information title="10589 12801">contig of 2213 bp in length</information>
<information title="12802 12901">gap of 100 bp</information>
<information title="12902 18716">contig of 5815 bp in length</information>
<information title="18717 18816">gap of 100 bp</information>
<information title="18817 34793">contig of 15977 bp in length</information>
<information title="34794 34893">gap of 100 bp</information>
<information title="34894 51004">contig of 16111 bp in length</information>
<information title="51005 51104">gap of 100 bp</information>
<information title="51105 132912">contig of 81808 bp in length.</information>
</comment>
<features>
<sequence_feature type="source">
<location>1..132912</location>
<qualifer type="db_xref">taxon:10090</qualifer>
<qualifer type="clone">RP24-146B1</qualifer>
<qualifer type="clone_lib">RPCI-24 Male Mouse BAC</qualifer>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>1..1178</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>1279..2835</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>2936..5385</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>5486..8192</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>8293..10488</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>10589..12801</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>12902..18716</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>18817..34793</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>34894..51004</location>
</sequence_feature>
<sequence_feature type="misc_feature">
<location>51105..132912</location>
</sequence_feature>
</features>
<base_count num_a="43599" num_c="24512" num_g="23668" num_t="40195" num_others="938" />
<sequence>mhkkiciigagaaglvsakhaikqgyqvdifeqtdqvggtwvysektgchsslykvmktn
lpkeamlfqdepfrdelpsfmshehvleylnefskdfpiqfsstvnevkrendlwkvlie
snsetitrfydvvfvcnghffeplnpyqnsyfkgklihshdyrraehytgknvvivgagp
sgiditlqiaqtanhvtliskkatypvlpesvqqmatnvksvdehgvvtdegdhvpadvi
ivctgyvfkfpfldssliqlkyndrmvsplyehlchvdypttlffiglplgtitfplfev
qvkyalsliagkgklpsddveirnfedarlqgllnpasfhviieeqweymkklakmggfe
ewnymetikklygyimterkknvigykmvnfelttdssdfklltirvdfnddvawiirfa
ypi</sequence>
</entry>
I wish to generate a tree-plot graph using Plotly and igraph libraries by enumerating the list of nodes.
I am using this website here as a reference.
My XML file has elements with variable number of sub-elements. However, the example given only shows me how to develop a tree with a fixed number of children nodes (the example shows a fixed number of 2 children per node)
Looking at the igraph tutorial website here, I see a similar example, where they only use 2 children nodes per node.
How should I go about generating a tree with variable number of children nodes such as in my XML file?
I've been stuck on this for so long, any help would be greatly appreciated!
Upvotes: 0
Views: 894
Reputation: 1892
You can create the graph like that:
from lxml import etree
from igraph import Graph
root = etree.parse("entry.xml").getroot()
element_ids = {elem: i for i, elem in enumerate(root.iter())}
edges = []
for parent, parent_id in element_ids.items():
for child in parent.getchildren():
edges.append((parent_id, element_ids[child]))
G = Graph(edges)
element_ids
dictionary will contain all the tags in the XML as keys and different ids for all the elements like {tag1: 0, tag2: 1, tag3: 2}
. That way you will find the ids for all the tags later.
I don't know how to put labels into plotly, but for plotting with igraph it can be useful to add the tag names as labels:
names = [e.tag for e in element_ids]
G.vs['label'] = names
I have not tried but having the graph plotly visualization must be the same as in the article.
Upvotes: 1