justin viola
justin viola

Reputation: 49

XML with multiple DOCTYPE declarations

Hello I have a rather large XML file 10-15gb. It contains multiple root Doctype tags, my guess is whoever made it just concatenated a bunch of separate files together. This is definitely not best practice but sometimes it is all you have to work with. I am wondering if anyone has a solution for parsing the file or separating the files into each individual DocType.

So far I have tried wrapping the entire file in one single root tag but this did not work. I am working in Python.

Any solution or input would be appreciated.


<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">

<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
  <?properties open_access?>
  <front>
    <p>
    Apple
    </p>
  </front>
</article>
</pmc-articleset>
<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
  <?properties open_access?>
  <front>
    <p>
    Banana
    </p>
  </front>
</article>
</pmc-articleset>

  


Upvotes: 0

Views: 1399

Answers (2)

Tomalak
Tomalak

Reputation: 338416

Splitting a file into multiple parts could be done with csplit(1), which is the utility for the task.

Either at the XML declaration <?xml ...

csplit -z --prefix output_file --suffix-format '%02d.xml' your_large.xml '/^<[?]xml[ ]/' {*}

or, if that does not repeat, at the <!DOCTYPE

csplit -z --prefix output_file --suffix-format '%02d.xml' your_large.xml '/<!DOCTYPE/' {*}

which will result in output_file00.xml, output_file01.xml, etc.

Upvotes: 2

imhotap
imhotap

Reputation: 2535

If your input document prolog actually contains multiple document type declarations (multiple DOCTYPEs), or doesn't appear to have a document element, then it might very well be full SGML rather than XML. Though your example code has neither.

Upvotes: 1

Related Questions