sameer karjatkar
sameer karjatkar

Reputation: 2057

XML Split of a Large file

I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn't have any top nodes which are interdependent .Is there any tool available which readily does this for me ?

Upvotes: 24

Views: 95376

Answers (10)

user11106941
user11106941

Reputation: 11

Perhaps this question is actual still and I believe it can help somebody. There is an xml editor XiMpLe which contains a tool for splitting big files. Only fragment size is required. And there is also reverse functionality to link xml files together(!). It's free for non-commercial use and the license is not expensive too. No installation is required. For me it worked very good (I had 5GB file).

Upvotes: 1

Farid
Farid

Reputation: 189

I used XmlSplit Wizard tool. It really work nicely and you can specify the split method like element, rows, number of files, or the size of files. The only problem is that I had to buy it for 99$ as the trial version wont allow you to all split data, only odd number of divided files. I was able to split a 70GB file !

Upvotes: 0

Shivendra
Shivendra

Reputation: 1558

Used this for splitting Yahoo Q&A dataset

    count = 0
    file_count = 1
    with open('filepath') as f:

    current_file = ""

    for line in f:
        current_file = current_file + line

        if "</your tag to split>" in line:
            count = count + 1

        if count==50000:
            current_file = current_file + "</endTag>"
            with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
                split.write(current_file)
            file_count = file_count + 1
            current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
            count = 0

current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
    split.write(current_file)

Upvotes: 0

mat_geek
mat_geek

Reputation: 2511

The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.

https://github.com/acfr/comma/wiki/XML-Utilities

The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.

Upvotes: 1

Gfy
Gfy

Reputation: 8319

XmlSplit - A Command-line Tool That Splits Large XML Files

xml_split - split huge XML documents into smaller chunks

Split that XML by bhayanakmaut (No source code and I could not get this one working)

A similar question: How do I split a large xml file?

Upvotes: 12

eleg
eleg

Reputation: 141

QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.

Upvotes: 13

John Saunders
John Saunders

Reputation: 161773

In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.

Upvotes: 0

Ben Bryant
Ben Bryant

Reputation: 3143

Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:

split_xml_15GB()
{
  int nObjectCount = 0, nFileCount = 0;
  CMarkup xmlInput, xmlOutput;
  xmlInput.Open( "15GB.xml", MDF_READFILE );
  xmlInput.FindElem(); // root
  str sRootTag = xmlInput.GetTagName();
  xmlInput.IntoElem();
  while ( xmlInput.FindElem() )
  {
    if ( nObjectCount == 0 )
    {
      ++nFileCount;
      xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE );
      xmlOutput.AddElem( sRootTag );
      xmlOutput.IntoElem();
    }
    xmlOutput.AddSubDoc( xmlInput.GetSubDoc() );
    ++nObjectCount;
    if ( nObjectCount == 1000000 )
    {
      xmlOutput.Close();
      nObjectCount = 0;
    }
  }
  if ( nObjectCount )
    xmlOutput.Close();
  xmlInput.Close();
  return nFileCount;
}

I posted a youtube video and article about this here:

http://www.firstobject.com/xml-splitter-script-video.htm

Upvotes: 5

Cerebrus
Cerebrus

Reputation: 25775

I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.

I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.

Other options worth considering:

  1. EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.

  2. VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.

  3. EmEditor

Upvotes: 5

MrTelly
MrTelly

Reputation: 14865

Not an Xml tool but Ultraedit could probably help, I've used it with 2G files and it didn't mind at all, make sure you turn off the auto-backup feature though.

Upvotes: -1

Related Questions