Jérémz
Jérémz

Reputation: 393

Python: Extract XML element value when children attribute meet criteria

I'm a very beginner in XML parsing and I have trouble to extract specific values when children attribute meet some criteria.

Here an example of my xml file (from http://www.uniprot.org/uniprot/Q63HN8.xml):

<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
  <entry dataset="Swiss-Prot" created="2007-02-20" modified="2015-09-16" version="112">
    <accession>Q63HN8</accession>
    <accession>C9JCP4</accession>
    <accession>D6RI12</accession>
  <dbReference type="Proteomes" id="UP000005640">
    <property type="component" value="Chromosome 17"/>
  </dbReference>
  <dbReference type="Bgee" id="Q63HN8"/>
  <dbReference type="CleanEx" id="HS_KIAA1618"/>
  <dbReference type="ExpressionAtlas" id="Q63HN8">
    <property type="expression patterns" value="baseline and differential"/>
  </dbReference>
  <dbReference type="GO" id="GO:0005737">
    <property type="term" value="C:cytoplasm"/>
    <property type="evidence" value="ECO:0000314"/>
    <property type="project" value="UniProtKB"/>
  </dbReference>
  <dbReference type="GO" id="GO:0016020">
    <property type="term" value="C:membrane"/>
    <property type="evidence" value="ECO:0000314"/>
    <property type="project" value="UniProtKB"/>
  </dbReference>
  <dbReference type="GO" id="GO:0016887">
    <property type="term" value="F:ATPase activity"/>
    <property type="evidence" value="ECO:0000314"/>
    <property type="project" value="UniProtKB"/>
  </dbReference>
  <dbReference type="GO" id="GO:0016874">
    <property type="term" value="F:ligase activity"/>
    <property type="evidence" value="ECO:0000501"/>
    <property type="project" value="UniProtKB-KW"/>
  </dbReference>

I would like to extract the "id" values in dbReference when "value" in property attribute start with "C:" So the expected output is : "GO:0005737" "GO:0016020"

Here is my script so far:

import urllib2
from lxml import etree

file = urllib2.urlopen('http://www.uniprot.org/uniprot/Q63HN8.xml')
tree = etree.parse(file)
root = tree.getroot()
for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
   if node.attrib.get('type') == 'GO':
        value = node.attrib.get('value');
        print value
        if value.str.startswith('C:'):
            goterm = node.attrib.get('id')
            print goterm

But it is nowhere near to work.

EDIT

Also, how can I store the values for different search into lists? Expected: goterm_when_C = ['GO:0005737', 'GO:0016020', 'GO:0005730'] goterm_when_F = ['GO:0016887', 'GO:0016874', 'GO:0004842', 'GO:0008270'] When I try:

goterm_when_C = []
goterm_when_F = []
            if value.startswith('C:'):
                go_location = node.attrib.get('id')
                for item in go_location:
                    goterm_when_C.append(item)
            if value.startswith('F:'):
                go_function = node.attrib.get('id')
                for item in go_function:
                    goterm_when_F.append(item)
                break

I get

>>> goterm_when_C
['G', 'O', ':', '0', '0', '0', '5', '7', '3', '7', 'G', 'O', ':', '0', '0', '1', '6', '0', '2', '0', 'G', 'O', ':', '0', '0', '0', '5', '7', '3', '0']

Any help would be greatly appreciated

Upvotes: 1

Views: 3133

Answers (1)

Anand S Kumar
Anand S Kumar

Reputation: 90999

You need to iterate over the child nodes, and then check its attributes. Example -

for node in tree.iter('{http://uniprot.org/uniprot}dbReference'):
   if node.attrib.get('type') == 'GO':
       for child in node:
           value = child.attrib.get('value');
           print value
           if value.startswith('C:'):
               goterm = node.attrib.get('id')
               print goterm
               break

Upvotes: 1

Related Questions