Reputation: 1344
I have several large .xml files. I want to parse out the files to do several things.
I want to pull out only:
Using Python 2.x which library would be best to import/use. How would I set this up? Any Suggestions?
For Example:
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">8981971</PMID>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0002-9297</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>60</Volume>
<Issue>1</Issue>
<PubDate>
<Year>1997</Year>
<Month>Jan</Month>
</PubDate>
</JournalIssue>
<Title>American journal of human genetics</Title>
<ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
</Journal>
<ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
<Pagination>
<MedlinePgn>241-4</MedlinePgn>
</Pagination>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Scozzari</LastName>
<ForeName>R</ForeName>
<Initials>R</Initials>
</Author>
</AuthorList>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Alleles</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName>
</MeshHeading>
</MeshHeadingList>
<OtherID Source="NLM">PMC1712541</OtherID>
</MedlineCitation>
</PubmedArticle>
Upvotes: 3
Views: 2872
Reputation: 11178
I'm not sure why you want each title in its own list, which your question leads me to believe.
How about all the titles in one list? The following example uses a trimmed version of your sample XML, plus I duplicated an <Article/>
to show that using lxml.etree.xpath
creates the list of <Title/>'s
for you:
>>> import lxml.etree
>>> xml_text = """<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">8981971</PMID>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0002-9297</ISSN>
<!-- <JournalIssue ... /> -->
<Title>American journal of human genetics</Title>
<ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
</Journal>
<ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
<!--<Pagination>
...
</MeshHeadingList>-->
<OtherID Source="NLM">PMC1712541</OtherID>
</Article>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">9297-0002</ISSN>
<!-- <JournalIssue ... /> -->
<Title>American Journal of Pediatrics</Title>
<ISOAbbreviation>Am. J. Ped.</ISOAbbreviation>
</Journal>
<ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle>
<!--<Pagination>
...
</MeshHeadingList>-->
<OtherID Source="NLM">PMC1712541</OtherID>
</Article>
</MedlineCitation>
</PubmedArticle>"""
XPaths are made to return nodes which lxml.etree.xpath
translates into a Python list of node objects:
>>> xml_obj = lxml.etree.fromstring(xml_text)
>>> for title_obj in xml_obj.xpath('//Article/Journal/Title'):
print title_obj.text
American journal of human genetics
American Journal of Pediatrics
EDIT 1: Now with Python's xml.etree.ElementTree
I wanted to show this solution with the included module in case installing a third party module is not possible or unattractive.
>>> import xml.etree.ElementTree as ETree
>>> element = ETree.fromstring(xml_text)
>>> xml_obj = ETree.ElementTree(element)
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'):
print title_obj.text
American journal of human genetics
American Journal of Pediatrics
It's small, but this XPath is not identical to the XPath in the lxml
example: there is a period ('.') at the beginning. Without the period, I got this warning (with Python 2.7.2):
>>> xml_obj.findall('//Article/Journal/Title')
Warning (from warnings module):
File "__main__", line 1
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely on the current behaviour, change it to './/Article/Journal/Title'
Upvotes: 2
Reputation: 20229
Try using Beautiful soup. I have found this library very handy. And as just pointed out, BeautifulStoneSoup specifically is for parsing XML.
Upvotes: 2
Reputation: 5596
Try look at the lxml module.
To locate the titles you can use Xpath with lxml, or you can use the xml object structure in lxml to "index" you down to the title elements.
Upvotes: 5
Reputation: 49537
Try lxml with xpath expressions.
A short snippet
>>> from lxml import etree
>>> xml = """<foo><bar/>baz!</foo>"""
>>> doc = etree.fromstring(xml)
>>> doc.xpath('//foo/text()') #xpath expr
['baz!']
>>>
if you have a xml file
than
s = StringIO(xml)
doc = etree.parse(s)
You can use Firebug addon
to fetch the xpath expr
.
Upvotes: 1