panofish
panofish

Reputation: 7899

Python ElementTree XML Parsing

I am trying to parse an xml file that I got by exporting a pdf to xml 1.0 using adobe pro. I am using Python and ElementTree to parse with. The pdf contains a table which spans multiple pages and has several different table headers.

I want to parse and extract the row and column data from the table which begins with the header that contains a particular string (e.g. "MECHANICAL") and stop at the next table heading section (e.g. "COMPLETED"). Thereby excluding all row and column data before and after this section. There is no easy tag to parse, the tag pattern just repeats.

Here is my current python code:

# Python

import sys
import re     # regular expression
import xml.etree.ElementTree as xml

tree = xml.parse("C:/Documents and Settings/alilly.CORPORATE/Desktop/python xml parse/excerpt.xml")

print "=================== Find Columns ===================="    

for node in tree.iter('TR'):

    print "tag=",node.tag

    count = len(node.getiterator('TD'))

    #if count != 10:
    #    continue

    print "------------"

    for col in node.getiterator('TD'):
        print "      tag=",col.tag, "attrib=", col.attrib, "text=", col.text


print "=================== Find Headers ===================="

# find headers
for node in tree.iter('ImageData'):
    print "figure text = ", node.tail

And here is my XML file:

<?xml version="1.0" encoding="UTF-8" ?>
<!-- Created from PDF via Acrobat SaveAsXML -->
<!-- Mapping Table version: 28-February-2003 -->
<TaggedPDF-doc>
<?xpacket begin='?' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
         <pdf:Producer>GPL Ghostscript 8.70</pdf:Producer>
         <pdf:Keywords/>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/">
         <xmp:ModifyDate>2011-03-01T09:36:13-05:00</xmp:ModifyDate>
         <xmp:CreateDate>2011-03-01T09:36:13-05:00</xmp:CreateDate>
         <xmp:CreatorTool>PDFCreator Version 1.0.2</xmp:CreatorTool>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
         <xmpMM:DocumentID>d417764e-466c-11e0-0000-f7ea6a538d79</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:0c6ada50-6db0-4d59-88e1-fc23aa6ebc14</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>xml</dc:format>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">my pdf file</rdf:li>
            </rdf:Alt>
         </dc:title>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>ltamm</rdf:li>
            </rdf:Seq>
         </dc:creator>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default"/>
               <rdf:li xml:lang="x-repair"/>
            </rdf:Alt>
         </dc:description>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
<?xpacket end='r'?>
<Part>
<H1>Misc </H1>
<Sect>
<H3>This is a test </H3>
<Sect>
<H5>Deletions </H5>
<L>
<LI>
<LI_Title>Special codes </LI_Title>
</LI>
</L>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
Main INTERIOR </Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TD><Figure>
<ImageData src=""/>
</Figure>
</TD>
<TD>S </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
This is the MECHANICAL header</Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TH>Free Flow </TH>
<TD>Ref. Code </TD>
<TD>DESCRIPTION </TD>
<TD>Rooster </TD>
<TD>747 Dog </TD>
<TD>888 Rabbit </TD>
</TR>
<TR>
<TD>xxx GOgo xxB </TD>
<TD>Beany xxx </TD>
<TD>nothing here xxx </TD>
<TD>xxx B </TD>
<TD>snake ddd </TD>
<TD>Cow fff </TD>
<TD>eee </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Squirrel Protection </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>J77 </TD>
<TD>Rocket Launcher </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Lunch </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>Jss5 </TD>
<TD>Now is the time for all good men </TD>
<TD>-</TD>
<TD>A1 </TD>
<TD>A1 </TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
</TR>
<TR>
<TD>Capacity </TD>
<TD/>
<TD>2/3 </TD>
<TD>2/3 </TD>
<TD>2/3 </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
Final COMPLETED PAGE 1 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Graphite </P>
<P>painted fun </P>
<P>Control yourself </P>
<Figure>
<ImageData src=""/>
Meaningless Header PAGE 2 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>)multi-coat </P>
<P>front</P>
<P>single-slot system </P>
<Figure>
<ImageData src=""/>
Almost Done Header PAGE 1 OF 1 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Snow Blizzard. </P>
<P>Done </P>
</Sect>
</Sect>
</Part>
</TaggedPDF-doc>

Upvotes: 2

Views: 4476

Answers (2)

Steven D. Majewski
Steven D. Majewski

Reputation: 2167

What exactly you're trying to select is not clear from your description. It sounds like you want to process all elements in between the elements containing the strings "MECHANICAL" and "COMPLETED" . ( In this example, that's just a single Table, but I assume it could be an arbitrary number of Tables. )

If you can use lxml, you can do the selection with xpath.

from lxml import etree
x = etree.parse( file('mech.xml'))
# select Tables following "MECHANICAL" :
fol = x.xpath( '//Figure[contains(., "MECHANICAL")]/following-sibling::Table[1]' )
# [<Element Table at 101532ec0>]
# select Tables preceding "COMPLETED" :
pre = x.xpath( '//Figure[contains(.,"COMPLETED")]/preceding-sibling::Table' )
# [<Element Table at 101532d08>, <Element Table at 101532ec0>]
# get their intersection:
tables = [ e for e in fol if e in pre ]
for t in tables:
     for tr in t.xpath( 'TR' ):
         # [ ... process ... ] 

Upvotes: 0

PeterBorocz
PeterBorocz

Reputation: 63

In cases where I need to keep state, I fall back to a SAX-style XML-parser, here's a sample script that simply pulls the rows between your MECHANICAL and COMPLETED figures.

#!python
import xml.sax
import xml.sax.handler

class Handler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.l_ch = list()
        self.__in_mechanical = False

    def startElement(self, name, attrs):
        if name == 'TR':
            self.l_rows = list()

    def characters (self, ch):
        self.l_ch += ch

    def endElement(self, name):
        if self.l_ch:
            ch = ''.join(self.l_ch).strip()

        if name == 'Figure':
            if ch.find('MECHANICAL') >= 0:
                self.__in_mechanical = True
            elif ch.find('COMPLETED') >= 0:
                self.__in_mechanical = False

        elif name == 'TD' and self.__in_mechanical:
            self.l_rows.append(ch)

        elif name == 'TR' and self.__in_mechanical:
            print 'Row:', self.l_rows
            self.l_rows = list()

        self.l_ch = list()

parser = xml.sax.make_parser()
parser.setContentHandler(Handler())
parser.parse(open('sample.xml'))

This gives me the following results and should get you going for more complexity.

Row: []
Row: [u'Ref. Code', u'DESCRIPTION', u'Rooster', u'747 Dog', u'888 Rabbit']
Row: [u'xxx GOgo xxB', u'Beany xxx', u'nothing here xxx', u'xxx B', u'snake ddd', u'Cow fff', u'eee']
Row: [u'', u'Squirrel Protection', u'S', u'S', u'S', u'S', u'S', u'S', u'S']
Row: [u'J77', u'Rocket Launcher', u'S', u'S', u'S', u'S', u'S', u'S', u'S']
Row: [u'', u'Lunch', u'S', u'S', u'S', u'S', u'S', u'S', u'S']
Row: [u'Jss5', u'Now is the time for all good men', u'-', u'A1', u'A1', u'-', u'-', u'-', u'-']
Row: [u'Capacity', u'', u'2/3', u'2/3', u'2/3']

Upvotes: 4

Related Questions