Reputation: 51927
We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file, while repeating the “header section” in each of the new files.
So I am looking for something that will let me define some xpaths for the section(s) that should always be outputted, and another xpath for the “rows” with a parameter that says how many rows to put in each file and how to name the files.
Before I start writing some custom .net code to do this; is there a standard command line tool that will work on windows that does it?
(As I know how to program in C#, I am more included to write code then try to mess about with complex xsl etc, but a "of the self" solution would be better then custom code.)
Upvotes: 11
Views: 31620
Reputation: 3106
As mentioned already the xml_split
from the Perl package XML::Twig does a great job.
xml_split < bigFile.xml
#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split
Without any arguments xml_split
creates a file per top-level child node.
There are parameters to specify the number of elements you want per file (-g
) or approximate size (-s <Kb|Mb|Gb>
).
sudo apt-get install xml-twig-tools
Upvotes: 4
Reputation: 629
Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).
// from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704
var FoundsPerFile = 200; // Global setting for number of found split strings per file.
var SplitString = "</letter>"; // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';
/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
var tabindex = -1; /* start value */
for (var i = 0; i < UltraEdit.document.length; i++)
{
if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
tabindex = i;
break;
}
}
return tabindex;
}
if (UltraEdit.document.length) { // Is any file open?
// Set working environment required for this job.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.hexOff();
UltraEdit.ueReOn();
// Move cursor to top of active file and run the initial search.
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.findReplace.searchDown=true;
UltraEdit.activeDocument.findReplace.matchCase=true;
UltraEdit.activeDocument.findReplace.matchWord=false;
UltraEdit.activeDocument.findReplace.regExp=false;
// If the string to split is not found in this file, do nothing.
if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
// This file is probably the correct file for this script.
var FileNumber = 1; // Counts the number of saved files.
var StringsFound = 1; // Counts the number of found split strings.
var NewFileIndex = UltraEdit.document.length;
/* Get the path of the current file to save the new
files in the same directory as the current file. */
var SavePath = "";
var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
if (LastBackSlash >= 0) {
LastBackSlash++;
SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
}
/* Get active file index in case of more than 1 file is open and the
current file does not get back the focus after closing the new files. */
var FileToSplit = getActiveDocumentIndex();
// Always use clipboard 9 for this script and not the Windows clipboard.
UltraEdit.selectClipboard(9);
// Split the file after every x found split strings until source file is empty.
while (1) {
while (StringsFound < FoundsPerFile) {
if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
else {
UltraEdit.document[FileToSplit].bottom();
break;
}
}
// End the selection of the find command.
UltraEdit.document[FileToSplit].endSelect();
// Move the cursor right to include the next character and unselect the found string.
UltraEdit.document[FileToSplit].key("RIGHT ARROW");
// Select from this cursor position everything to top of the file.
UltraEdit.document[FileToSplit].selectToTop();
// Is the file not already empty?
if (UltraEdit.document[FileToSplit].isSel()) {
// Cut the selection and paste it into a new file.
UltraEdit.document[FileToSplit].cut();
UltraEdit.newFile();
UltraEdit.document[NewFileIndex].setActive();
UltraEdit.activeDocument.paste();
/* Add line termination on the last line and remove automatically added indent
spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.insertLine();
if (UltraEdit.activeDocument.isColNumGt(1)) {
UltraEdit.activeDocument.deleteToStartOfLine();
}
}
// add headers and footers
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.write(xmlHead);
UltraEdit.activeDocument.write(xmlRootStart);
UltraEdit.activeDocument.bottom();
UltraEdit.activeDocument.write(xmlRootEnd);
// Build the file name for this new file.
var SaveFileName = SavePath + "LETTER";
if (FileNumber < 10) SaveFileName += "0";
SaveFileName += String(FileNumber) + ".raw.xml";
// Save the new file and close it.
UltraEdit.saveAs(SaveFileName);
UltraEdit.closeFile(SaveFileName,2);
FileNumber++;
StringsFound = 0;
/* Delete the line termination in the source file
if last found split string was at end of a line. */
UltraEdit.document[FileToSplit].endSelect();
UltraEdit.document[FileToSplit].key("END");
if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
UltraEdit.document[FileToSplit].top();
} else {
UltraEdit.document[FileToSplit].deleteLine();
}
} else break;
UltraEdit.outputWindow.write("Progress " + SaveFileName);
} // Loop executed until source file is empty!
// Close source file without saving and re-open it.
var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
UltraEdit.closeFile(NameOfFileToSplit,2);
/* The following code line could be commented if the source
file is not needed anymore for further actions. */
UltraEdit.open(NameOfFileToSplit);
// Free memory and switch back to Windows clipboard.
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(0);
}
}
Upvotes: 1
Reputation: 665
First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip
Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.
There is a script code on that page (starts with split()
) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:
split()
{
CMarkup xmlInput, xmlOutput;
xmlInput.Open( "**50MB.xml**", MDF_READFILE );
int nObjectCount = 0, nFileCount = 0;
while ( xmlInput.FindElem("//**ACT**") )
{
if ( nObjectCount == 0 )
{
++nFileCount;
xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
xmlOutput.AddElem( "**root**" );
xmlOutput.IntoElem();
}
xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
++nObjectCount;
if ( nObjectCount == **5** )
{
xmlOutput.Close();
nObjectCount = 0;
}
}
if ( nObjectCount )
xmlOutput.Close();
xmlInput.Close();
return nFileCount;
}
Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)
On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.
Note:
input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml"
(double slashes)
and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"
Upvotes: 4
Reputation: 8369
xml_split - split huge XML documents into smaller chunks
http://www.perlmonks.org/index.pl?node_id=429707
http://metacpan.org/pod/XML::Twig
Upvotes: 3
Reputation: 48
"is there a standard command line tool that will work on windows that does it?"
Yes. http://xponentsoftware.com/xmlSplit.aspx
Upvotes: -2
Reputation: 96870
There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.
It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:
<header>
<data rec="1"/>
<data rec="2"/>
<data rec="3"/>
<data rec="4"/>
<data rec="5"/>
<data rec="6"/>
</header>
you can output a copy of the file containing only data
elements within a certain range with this XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:param name="startPosition"/>
<xsl:param name="endPosition"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="header">
<xsl:copy>
<xsl:apply-templates select="data"/>
</xsl:copy>
</xsl:template>
<xsl:template match="data">
<xsl:if test="position() >= $startPosition and position() <= $endPosition">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
(Note, by the way, that because this is based on the identity transform, it works even if header
isn't the top-level element.)
You still need to count the data
elements in the source XML, and run the transform repeatedly with the values of $startPosition
and $endPosition
that are appropriate for the situation.
Upvotes: 3
Reputation: 499302
There is nothing built in that can handle this situation easily.
Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".
Update:
After a bit of digging, I found this article describing a way to split files using XSLT.
Upvotes: 1