Reputation: 23
I am looking to extract elements from a large XML file to individual files preferably with a command or script.
The issue is that the XML is not properly formed and is proprietary and whenever I try to use XML utilities like twig or xmlstarlet the data gets munged improperly and special characters get messed up. Hence my need for a simply regex match and direct copy of exactly what matches to a file (iteratively) for each match where the file names iterate to say match1.xml match2.xml
Example XML source:
...
<testcase id="001" kind="bvt">
<inputs>
<arg1>4</arg1>
<arg2>7</arg2>
</inputs>
<expected>11.00</expected>
</testcase>
<testcase id="002" kind="drt">
<inputs>
<arg1>9</arg1>
<arg2>6</arg2>
</inputs>
<expected>15.00</expected>
</testcase>
<testcase id="003" kind="bvt">
<inputs>
<arg1>5</arg1>
<arg2>8</arg2>
</inputs>
<expected>13.00</expected>
</testcase>
...
Desired output: Content of match1.xml:
...
<testcase id="001" kind="bvt">
<inputs>
<arg1>4</arg1>
<arg2>7</arg2>
</inputs>
<expected>11.00</expected>
</testcase>
...
Content of match2.xml:
..
<testcase id="002" kind="drt">
<inputs>
<arg1>9</arg1>
<arg2>6</arg2>
</inputs>
<expected>15.00</expected>
</testcase>
...
and so on.
Here is some regex I put together that will work. All I need is an assist on putting together a loop in a bash script to copy each match / element to its own file.
(<testcase*[\s\S]*?<\/testcase>)
Upvotes: 0
Views: 521
Reputation: 23
Figured it out! Python has a great regex module "re" that I used to solve this.
Below is the python I used. In this case the element was everything (including line breaks carriage returns, line feeds special characters etc.) until and includes the element tag (as needed in this use case).
Every object element gets incrementally written to it's own package-0000 - package-nnnn file and the content is exactly what was in the original file (no munging issues)! :)
import re
from re import match
pattern = re.compile(r'(<object>[\s\S]*?<\/object>)', flags=re.S)
with open("/temp/Test/package1.xml", 'r') as f:
matches = pattern.findall(f.read())
for i, match in enumerate(matches):
with open("/temp/Test/package-{0:04d}.xml".format(i), 'w') as nf:
nf.write(match)
Upvotes: 0
Reputation: 27476
Using xmllint to do the parsing (assuming your xml is in a.xml file and main node is named testcases):
for num in `cat a.xml | xmllint --xpath '/testcases/testcase/@id' - | sed -r 's/[^"]+"([0-9]+)"/\1 /g'`; do
cat a.xml | xmllint --xpath "/testcases/testcase[@id=$num]" - > $num.xml;
done
First we get the testcase ids (xpath returns them in form of id="001"
so sed
is used to retrieve just the numbers).
Then xpath to retrieve just the testcase with appropriate id and saving it to the file with the id name.
Upvotes: 3
Reputation: 3377
It is actually a short code piece to write and test... here it is, combining xpath and vtd-xml.
import com.ximpleware.*;
import java.io.*;
public class simpleSplit {
public static void main(String[] s) throws VTDException,IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("d:\\xml\\inputTest.xml", false)) //namespace awareness disabled
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
AutoPilot ap2 = new AutoPilot(vn);
ap.selectXPath("/root/testcase"); // main xpath expression
ap2.selectXPath("@id");
byte[] head = "<root>".getBytes();
byte[] tail = "</root>".getBytes();
int i=0;
while((i=ap.evalXPath())!=-1){
String fileName = ap2.evalXPathToString();
FileOutputStream fios = new FileOutputStream("d:\\xml\\"+fileName+".xml");
long l = vn.getElementFragment();
fios.write(head);
fios.write(vn.getXML().getBytes(), (int)l, (int)(l>>32));
fios.write(tail);
fios.close();
}
}
}
Upvotes: 0