Jas
Jas

Reputation: 11

Parsing large XML file with lxml

I am trying to parse the dblp.xml file(3.2gb) using lxml. The following below is my code.

from lxml import etree
from io import StringIO, BytesIO
tree = etree.parse("dblp.xml")

However I get an error stating :

OSError                                   Traceback (most recent call last)
<ipython-input-5-6a342013a160> in <module>
      1 from lxml import etree
      2 from io import StringIO, BytesIO
----> 3 tree = etree.parse("dblp.xml")

src/lxml/etree.pyx in lxml.etree.parse()

src/lxml/parser.pxi in lxml.etree._parseDocument()

src/lxml/parser.pxi in lxml.etree._parseDocumentFromURL()

src/lxml/parser.pxi in lxml.etree._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._BaseParser._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()

src/lxml/parser.pxi in lxml.etree._handleParseResult()

src/lxml/parser.pxi in lxml.etree._raiseParseError()

OSError: Error reading file 'dblp.xml': failed to load external entity "dblp.xml"

Both dblp.xml and dblp.dtd is in the root folder already.

Please help!

Upvotes: 0

Views: 2277

Answers (2)

Maciej Wrobel
Maciej Wrobel

Reputation: 660

As Jan Jaap Meijerink stated, you may try to use iterparse. Possibly you could also disable lxml security features preventing parsing huge files (see documentation at https://lxml.de/api/lxml.etree.XMLParser-class.html):

with open('', 'r') as fobj:
    for event, elem in  etree.iterparse(
                    fobj,
                    huge_tree=True,
                ):
            #do something with element or event

Eventually, if you prefer trying use of parse, you may define xml parser with huge_tree enabled and set it as default for further usages of etree.parse:

xml_parser_settings = dict(
    huge_tree=True, # resolve_entities=False, remove_pis=True, no_network=True
)

XMLPARSER = etree.XMLParser(xml_parser_settings)
etree.set_default_parser(XMLPARSER)

After those statements you may use etree.parser with configured XMLPARSER. Beware of multithreading, though (https://lxml.de/1.3/api/lxml.etree-module.html#set_default_parser).

Adding resolve_entities, remove_pis and no_network keyword may (at least a bit) reduce your risk of parsing huge extarnal files, if they come from untrusted source.

Upvotes: 1

Jan Jaap Meijerink
Jan Jaap Meijerink

Reputation: 427

You can use etree.iterparse to avoid loading the whole file in memory:

events = ("start", "end")
with open("dblp.xml", "r") as fo:
    context = etree.iterparse(fo, events=events)
    for action, elem in context:
        # Do something

This will allow you to only extract entities you need while ignoring others.

Upvotes: 2

Related Questions