Ian Ringrose
Ian Ringrose

Reputation: 51927

How do I split a large xml file?

We export “records” to an xml file; one of our customers has complained that the file is too big for their other system to process. Therefore I need to split up the file, while repeating the “header section” in each of the new files.

So I am looking for something that will let me define some xpaths for the section(s) that should always be outputted, and another xpath for the “rows” with a parameter that says how many rows to put in each file and how to name the files.

Before I start writing some custom .net code to do this; is there a standard command line tool that will work on windows that does it?

(As I know how to program in C#, I am more included to write code then try to mess about with complex xsl etc, but a "of the self" solution would be better then custom code.)

Upvotes: 11

Views: 31620

Answers (7)

loomi
loomi

Reputation: 3106

As mentioned already the xml_split from the Perl package XML::Twig does a great job.

Usage

xml_split < bigFile.xml

#or if compressed e.g.
bzcat bigFile.xml.bz2 | xml_split

Without any arguments xml_split creates a file per top-level child node.

There are parameters to specify the number of elements you want per file (-g) or approximate size (-s <Kb|Mb|Gb>).

Installation

Windows

Look here

Linux

sudo apt-get install xml-twig-tools

Upvotes: 4

Steve Black
Steve Black

Reputation: 629

Using Ultraedit based on https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704

All I added was some XML header and footer bits The first and last file need to be manually fixed (or remove the root element from your source).

    // from https://www.ultraedit.com/forums/viewtopic.php?f=52&t=6704 

var FoundsPerFile = 200;      // Global setting for number of found split strings per file.
var SplitString = "</letter>";  // String where to split. The split occurs after next character.
var xmlHead = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
var xmlRootStart = '<letters xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" letterCode="OA01" >';
var xmlRootEnd = '</letters>';

/* Find the tab index of the active document */
// Copied from http://www.ultraedit.com/forums/viewtopic.php?t=4571
function getActiveDocumentIndex () {
   var tabindex = -1; /* start value */

   for (var i = 0; i < UltraEdit.document.length; i++)
   {
      if (UltraEdit.activeDocument.path==UltraEdit.document[i].path) {
         tabindex = i;
         break;
      }
   }
   return tabindex;
}

if (UltraEdit.document.length) { // Is any file open?
   // Set working environment required for this job.
   UltraEdit.insertMode();
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.hexOff();
   UltraEdit.ueReOn();

   // Move cursor to top of active file and run the initial search.
   UltraEdit.activeDocument.top();
   UltraEdit.activeDocument.findReplace.searchDown=true;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=false;
   // If the string to split is not found in this file, do nothing.
   if (UltraEdit.activeDocument.findReplace.find(SplitString)) {
      // This file is probably the correct file for this script.
      var FileNumber = 1;    // Counts the number of saved files.
      var StringsFound = 1;  // Counts the number of found split strings.
      var NewFileIndex = UltraEdit.document.length;
      /* Get the path of the current file to save the new
         files in the same directory as the current file. */
      var SavePath = "";
      var LastBackSlash = UltraEdit.activeDocument.path.lastIndexOf("\\");
      if (LastBackSlash >= 0) {
         LastBackSlash++;
         SavePath = UltraEdit.activeDocument.path.substring(0,LastBackSlash);
      }
      /* Get active file index in case of more than 1 file is open and the
         current file does not get back the focus after closing the new files. */
      var FileToSplit = getActiveDocumentIndex();
      // Always use clipboard 9 for this script and not the Windows clipboard.
      UltraEdit.selectClipboard(9);
      // Split the file after every x found split strings until source file is empty.
      while (1) {
         while (StringsFound < FoundsPerFile) {
            if (UltraEdit.document[FileToSplit].findReplace.find(SplitString)) StringsFound++;
            else {
               UltraEdit.document[FileToSplit].bottom();
               break;
            }
         }
         // End the selection of the find command.
         UltraEdit.document[FileToSplit].endSelect();
         // Move the cursor right to include the next character and unselect the found string.
         UltraEdit.document[FileToSplit].key("RIGHT ARROW");
         // Select from this cursor position everything to top of the file.
         UltraEdit.document[FileToSplit].selectToTop();
         // Is the file not already empty?
         if (UltraEdit.document[FileToSplit].isSel()) {
            // Cut the selection and paste it into a new file.
            UltraEdit.document[FileToSplit].cut();
            UltraEdit.newFile();
            UltraEdit.document[NewFileIndex].setActive();
            UltraEdit.activeDocument.paste();


            /* Add line termination on the last line and remove automatically added indent
               spaces/tabs if auto-indent is enabled if the last line is not already terminated. */
            if (UltraEdit.activeDocument.isColNumGt(1)) {
               UltraEdit.activeDocument.insertLine();
               if (UltraEdit.activeDocument.isColNumGt(1)) {
                  UltraEdit.activeDocument.deleteToStartOfLine();
               }
            }

            // add headers and footers 

            UltraEdit.activeDocument.top();
            UltraEdit.activeDocument.write(xmlHead);
                        UltraEdit.activeDocument.write(xmlRootStart);
            UltraEdit.activeDocument.bottom();
            UltraEdit.activeDocument.write(xmlRootEnd);
            // Build the file name for this new file.
            var SaveFileName = SavePath + "LETTER";
            if (FileNumber < 10) SaveFileName += "0";
            SaveFileName += String(FileNumber) + ".raw.xml";
            // Save the new file and close it.
            UltraEdit.saveAs(SaveFileName);
            UltraEdit.closeFile(SaveFileName,2);
            FileNumber++;
            StringsFound = 0;
            /* Delete the line termination in the source file
               if last found split string was at end of a line. */
            UltraEdit.document[FileToSplit].endSelect();
            UltraEdit.document[FileToSplit].key("END");
            if (UltraEdit.document[FileToSplit].isColNumGt(1)) {
               UltraEdit.document[FileToSplit].top();
            } else {
               UltraEdit.document[FileToSplit].deleteLine();
            }
         } else break;
            UltraEdit.outputWindow.write("Progress " + SaveFileName);
      }  // Loop executed until source file is empty!

      // Close source file without saving and re-open it.
      var NameOfFileToSplit = UltraEdit.document[FileToSplit].path;
      UltraEdit.closeFile(NameOfFileToSplit,2);
      /* The following code line could be commented if the source
         file is not needed anymore for further actions. */
      UltraEdit.open(NameOfFileToSplit);

      // Free memory and switch back to Windows clipboard.
      UltraEdit.clearClipboard();
      UltraEdit.selectClipboard(0);
   }
}

Upvotes: 1

ewroman
ewroman

Reputation: 665

First download foxe xml editor from this link http://www.firstobject.com/foxe242.zip

Watch that video http://www.firstobject.com/xml-splitter-script-video.htm Video explains how split code works.

There is a script code on that page (starts with split() ) copy the code and on the xml editor program make a "New Program" under the "File". Paste the code and save it. The code is:

split()
{
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "**50MB.xml**", MDF_READFILE );
  int nObjectCount = 0, nFileCount = 0;
  while ( xmlInput.FindElem("//**ACT**") )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "**piece**" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( "**root**" );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == **5** )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

Change the bold marked (or ** ** marked) fields for your needs. (this is also expressed at the video page)

On the xml editor window right click and click the RUN (or simply F9). There is output bar on the window where it shows number of files that generated.

Note: input File name can be "C:\\Users\\AUser\\Desktop\\a_xml_file.xml" (double slashes) and output file "C:\\Users\\AUser\\Desktop\\anoutputfolder\\piece" + nFileCount + ".xml"

Upvotes: 4

Gfy
Gfy

Reputation: 8369

xml_split - split huge XML documents into smaller chunks

http://www.perlmonks.org/index.pl?node_id=429707

http://metacpan.org/pod/XML::Twig

Upvotes: 3

bill seacham
bill seacham

Reputation: 48

"is there a standard command line tool that will work on windows that does it?"

Yes. http://xponentsoftware.com/xmlSplit.aspx

Upvotes: -2

Robert Rossney
Robert Rossney

Reputation: 96870

There's no general-purpose solution to this, because there's so many different possible ways that your source XML could be structured.

It's reasonably straightforward to build an XSLT transform that will output a slice of an XML document. For instance, given this XML:

<header>
  <data rec="1"/>
  <data rec="2"/>
  <data rec="3"/>
  <data rec="4"/>
  <data rec="5"/>
  <data rec="6"/>
</header>

you can output a copy of the file containing only data elements within a certain range with this XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes"/>
  <xsl:param name="startPosition"/>
  <xsl:param name="endPosition"/>

  <xsl:template match="@* | node()">
      <xsl:copy>
          <xsl:apply-templates select="@* | node()"/>
      </xsl:copy> 
  </xsl:template>

  <xsl:template match="header">
    <xsl:copy>
      <xsl:apply-templates select="data"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="data">
    <xsl:if test="position() &gt;= $startPosition and position() &lt;= $endPosition">
      <xsl:copy>
        <xsl:apply-templates select="@* | node()"/>
      </xsl:copy>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

(Note, by the way, that because this is based on the identity transform, it works even if header isn't the top-level element.)

You still need to count the data elements in the source XML, and run the transform repeatedly with the values of $startPosition and $endPosition that are appropriate for the situation.

Upvotes: 3

Oded
Oded

Reputation: 499302

There is nothing built in that can handle this situation easily.

Your approach sounds reasonable, though I would probably start with a "skeleton" document containing the elements that need to be repeated and generate several documents with the "records".


Update:

After a bit of digging, I found this article describing a way to split files using XSLT.

Upvotes: 1

Related Questions