AutoTester999
AutoTester999

Reputation: 616

XSLT 1.0 (xsltproc) - Unable to Parse Huge XML

I am trying to parse an input xml file that is 13,00,000 lines long with a size of 56 MB, using xsltproc. I get the below error:

input.xml:245393: parser error : internal error: Huge input lookup
              "description" : "List of values for possible department codes"
                          ^
unable to parse input.xml

My xsltproc was able to process an xml file that was 9,30,000 lines long with a size of 48 MB.

In fact, I tried decreasing the xml lines to 600,000 by removing the unnecessary parts. Still, same error, which is strange, because it is able to parse 900,000 but not 600,000.

How do I resolve this issue?

Upvotes: 5

Views: 1390

Answers (3)

nwellnhof
nwellnhof

Reputation: 33618

libxslt 1.1.35 added a --huge option to xsltproc which disables some internal limits like XML_MAX_LOOKUP_LIMIT.

Upvotes: 1

Adrian W
Adrian W

Reputation: 5026

Write your own xsltproc in Python based on this snippet:

import argparse

from lxml import etree

parser = argparse.ArgumentParser()
parser.add_argument('stylesheet', help='XSLT style sheet', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('input', help='XML input file(s)', nargs='*', type=argparse.FileType('r', encoding='utf-8'))
parser.add_argument('--output', help='The output file to create.', type=argparse.FileType('wb'))

args = parser.parse_args()

transform = etree.XSLT(etree.parse(args.stylesheet))

xml_parser = etree.XMLParser(huge_tree=True)

for xml in args.input:
    transform(etree.parse(xml, xml_parser)).write_output(args.output)

This uses lxml as suggested in this answer.

The huge_tree=True argument sets the corresponding parser option in libxml2 and thus enables it to process large files. See Parser options for more information.

Upvotes: 3

AutoTester999
AutoTester999

Reputation: 616

Using Oxygen XML Editor (Xalan) resolved my issue.

Upvotes: 0

Related Questions