Tony Nesavich
Tony Nesavich

Reputation: 23

Extract entire elements from large XML to individual files

I am looking to extract elements from a large XML file to individual files preferably with a command or script.

The issue is that the XML is not properly formed and is proprietary and whenever I try to use XML utilities like twig or xmlstarlet the data gets munged improperly and special characters get messed up. Hence my need for a simply regex match and direct copy of exactly what matches to a file (iteratively) for each match where the file names iterate to say match1.xml match2.xml

Example XML source:

...
  <testcase id="001" kind="bvt">
    <inputs>
      <arg1>4</arg1>
      <arg2>7</arg2>
    </inputs>
    <expected>11.00</expected>
  </testcase>
  <testcase id="002" kind="drt">
    <inputs>
      <arg1>9</arg1>
      <arg2>6</arg2>
    </inputs>
    <expected>15.00</expected>
  </testcase>
  <testcase id="003" kind="bvt">
    <inputs>
      <arg1>5</arg1>
      <arg2>8</arg2>
    </inputs>
    <expected>13.00</expected>
  </testcase>
...

Desired output: Content of match1.xml:

...
  <testcase id="001" kind="bvt">
    <inputs>
      <arg1>4</arg1>
      <arg2>7</arg2>
    </inputs>
    <expected>11.00</expected>
  </testcase>
...

Content of match2.xml:

..
  <testcase id="002" kind="drt">
    <inputs>
      <arg1>9</arg1>
      <arg2>6</arg2>
    </inputs>
    <expected>15.00</expected>
  </testcase>
...

and so on.

Here is some regex I put together that will work. All I need is an assist on putting together a loop in a bash script to copy each match / element to its own file.

(<testcase*[\s\S]*?<\/testcase>)

Upvotes: 0

Views: 521

Answers (3)

Tony Nesavich
Tony Nesavich

Reputation: 23

Figured it out! Python has a great regex module "re" that I used to solve this.

Below is the python I used. In this case the element was everything (including line breaks carriage returns, line feeds special characters etc.) until and includes the element tag (as needed in this use case).

Every object element gets incrementally written to it's own package-0000 - package-nnnn file and the content is exactly what was in the original file (no munging issues)! :)

import re

from re import match
pattern = re.compile(r'(<object>[\s\S]*?<\/object>)', flags=re.S)
with open("/temp/Test/package1.xml", 'r') as f:
    matches = pattern.findall(f.read())

for i, match in enumerate(matches):
    with open("/temp/Test/package-{0:04d}.xml".format(i), 'w') as nf:
        nf.write(match)

Upvotes: 0

Krzysztof Krasoń
Krzysztof Krasoń

Reputation: 27476

Using xmllint to do the parsing (assuming your xml is in a.xml file and main node is named testcases):

for num in `cat a.xml | xmllint --xpath '/testcases/testcase/@id' - | sed -r 's/[^"]+"([0-9]+)"/\1 /g'`; do
    cat a.xml | xmllint --xpath "/testcases/testcase[@id=$num]" - > $num.xml;
done

First we get the testcase ids (xpath returns them in form of id="001" so sed is used to retrieve just the numbers). Then xpath to retrieve just the testcase with appropriate id and saving it to the file with the id name.

Upvotes: 3

vtd-xml-author
vtd-xml-author

Reputation: 3377

It is actually a short code piece to write and test... here it is, combining xpath and vtd-xml.

import com.ximpleware.*;
import java.io.*;

public class simpleSplit {
    public static void main(String[] s) throws VTDException,IOException{
        VTDGen vg = new VTDGen();
        if (!vg.parseFile("d:\\xml\\inputTest.xml", false)) //namespace awareness disabled
            return;
        VTDNav vn = vg.getNav();
        AutoPilot ap = new AutoPilot(vn);
        AutoPilot ap2 = new AutoPilot(vn);
        ap.selectXPath("/root/testcase"); // main xpath expression
        ap2.selectXPath("@id");
        byte[] head = "<root>".getBytes();
        byte[] tail = "</root>".getBytes();
        int i=0;
        while((i=ap.evalXPath())!=-1){
            String fileName = ap2.evalXPathToString();
            FileOutputStream fios = new FileOutputStream("d:\\xml\\"+fileName+".xml");
            long l = vn.getElementFragment();
            fios.write(head);
            fios.write(vn.getXML().getBytes(), (int)l, (int)(l>>32));
            fios.write(tail);
            fios.close();
        }
    }
}

Upvotes: 0

Related Questions