mccurcio
mccurcio

Reputation: 1344

Parsing XML with Python

I have several large .xml files. I want to parse out the files to do several things.

I want to pull out only:

Using Python 2.x which library would be best to import/use. How would I set this up? Any Suggestions?

For Example:

 <PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
        <PMID Version="1">8981971</PMID>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Print">0002-9297</ISSN>
                <JournalIssue CitedMedium="Print">
                    <Volume>60</Volume>
                    <Issue>1</Issue>
                    <PubDate>
                        <Year>1997</Year>
                        <Month>Jan</Month>
                    </PubDate>
                </JournalIssue>
                <Title>American journal of human genetics</Title>
                <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
            </Journal>
            <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
            <Pagination>
                <MedlinePgn>241-4</MedlinePgn>
            </Pagination>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Scozzari</LastName>
                    <ForeName>R</ForeName>
                    <Initials>R</Initials>
                </Author>
            </AuthorList>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName MajorTopicYN="N">Alleles</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
        <OtherID Source="NLM">PMC1712541</OtherID>
    </MedlineCitation>
</PubmedArticle>

Upvotes: 3

Views: 2872

Answers (5)

Zach Young
Zach Young

Reputation: 11178

I'm not sure why you want each title in its own list, which your question leads me to believe.

How about all the titles in one list? The following example uses a trimmed version of your sample XML, plus I duplicated an <Article/> to show that using lxml.etree.xpath creates the list of <Title/>'s for you:

>>> import lxml.etree

>>> xml_text = """<PubmedArticle>
  <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">8981971</PMID>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">0002-9297</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American journal of human genetics</Title>
        <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">9297-0002</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American Journal of Pediatrics</Title>
        <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
  </MedlineCitation>
</PubmedArticle>"""

XPaths are made to return nodes which lxml.etree.xpath translates into a Python list of node objects:

>>> xml_obj = lxml.etree.fromstring(xml_text)
>>> for title_obj in xml_obj.xpath('//Article/Journal/Title'):
        print title_obj.text 

American journal of human genetics
American Journal of Pediatrics

EDIT 1: Now with Python's xml.etree.ElementTree

I wanted to show this solution with the included module in case installing a third party module is not possible or unattractive.

>>> import xml.etree.ElementTree as ETree
>>> element = ETree.fromstring(xml_text)
>>> xml_obj = ETree.ElementTree(element)
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'):
    print title_obj.text


American journal of human genetics
American Journal of Pediatrics

It's small, but this XPath is not identical to the XPath in the lxml example: there is a period ('.') at the beginning. Without the period, I got this warning (with Python 2.7.2):

>>> xml_obj.findall('//Article/Journal/Title')

Warning (from warnings module):
  File "__main__", line 1
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/Article/Journal/Title'

Upvotes: 2

01100110
01100110

Reputation: 2344

ElementTree is awesome and comes with Python.

Upvotes: -1

varunl
varunl

Reputation: 20229

Try using Beautiful soup. I have found this library very handy. And as just pointed out, BeautifulStoneSoup specifically is for parsing XML.

Upvotes: 2

aweis
aweis

Reputation: 5596

Try look at the lxml module.

To locate the titles you can use Xpath with lxml, or you can use the xml object structure in lxml to "index" you down to the title elements.

Upvotes: 5

RanRag
RanRag

Reputation: 49537

Try lxml with xpath expressions.

A short snippet

>>> from lxml import etree
>>> xml = """<foo><bar/>baz!</foo>"""
>>> doc = etree.fromstring(xml)
>>> doc.xpath('//foo/text()') #xpath expr
['baz!']
>>>

if you have a xml file than

s = StringIO(xml)
doc = etree.parse(s)

You can use Firebug addon to fetch the xpath expr.

Upvotes: 1

Related Questions