Reputation: 49
Hello I have a rather large XML file 10-15gb. It contains multiple root Doctype tags, my guess is whoever made it just concatenated a bunch of separate files together. This is definitely not best practice but sometimes it is all you have to work with. I am wondering if anyone has a solution for parsing the file or separating the files into each individual DocType.
So far I have tried wrapping the entire file in one single root tag but this did not work. I am working in Python.
Any solution or input would be appreciated.
<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<?properties open_access?>
<front>
<p>
Apple
</p>
</front>
</article>
</pmc-articleset>
<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<?properties open_access?>
<front>
<p>
Banana
</p>
</front>
</article>
</pmc-articleset>
Upvotes: 0
Views: 1399
Reputation: 338416
Splitting a file into multiple parts could be done with csplit(1)
, which is the utility for the task.
Either at the XML declaration <?xml ...
csplit -z --prefix output_file --suffix-format '%02d.xml' your_large.xml '/^<[?]xml[ ]/' {*}
or, if that does not repeat, at the <!DOCTYPE
csplit -z --prefix output_file --suffix-format '%02d.xml' your_large.xml '/<!DOCTYPE/' {*}
which will result in output_file00.xml
, output_file01.xml
, etc.
Upvotes: 2
Reputation: 2535
If your input document prolog actually contains multiple document type declarations (multiple DOCTYPEs), or doesn't appear to have a document element, then it might very well be full SGML rather than XML. Though your example code has neither.
Upvotes: 1