Reputation: 2057
I have a 15 GB XML file which I would want to split it .It has approximately 300 Million lines in it . It doesn't have any top nodes which are interdependent .Is there any tool available which readily does this for me ?
Upvotes: 24
Views: 95376
Reputation: 11
Perhaps this question is actual still and I believe it can help somebody. There is an xml editor XiMpLe which contains a tool for splitting big files. Only fragment size is required. And there is also reverse functionality to link xml files together(!). It's free for non-commercial use and the license is not expensive too. No installation is required. For me it worked very good (I had 5GB file).
Upvotes: 1
Reputation: 189
I used XmlSplit Wizard tool. It really work nicely and you can specify the split method like element, rows, number of files, or the size of files. The only problem is that I had to buy it for 99$ as the trial version wont allow you to all split data, only odd number of divided files. I was able to split a 70GB file !
Upvotes: 0
Reputation: 1558
Used this for splitting Yahoo Q&A dataset
count = 0
file_count = 1
with open('filepath') as f:
current_file = ""
for line in f:
current_file = current_file + line
if "</your tag to split>" in line:
count = count + 1
if count==50000:
current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
split.write(current_file)
file_count = file_count + 1
current_file = "<?xml version='1.0' encoding='UTF-8'?>\n<endTag>"
count = 0
current_file = current_file + "</endTag>"
with open('filepath/Split/file_' +str(file_count)+'.xml' , 'w') as split:
split.write(current_file)
Upvotes: 0
Reputation: 2511
The open source library comma has several tools to find data in very large XMl files and to split those files into smaller files.
https://github.com/acfr/comma/wiki/XML-Utilities
The tools were built using the expat SAX parser so that they did not fill memory with a DOM tree like xmlstarlet and saxon.
Upvotes: 1
Reputation: 8319
XmlSplit - A Command-line Tool That Splits Large XML Files
xml_split - split huge XML documents into smaller chunks
Split that XML by bhayanakmaut (No source code and I could not get this one working)
A similar question: How do I split a large xml file?
Upvotes: 12
Reputation: 141
QXMLEdit has a dedicated function for that: I used it successfully with a Wikipedia dump. The ~2.7Gio file became a bunch of ~1 400 000 files (one per page). It even allows you to dispatch them in subfolders.
Upvotes: 13
Reputation: 161773
In what way do you need to split it? It's pretty easy to write code using XmlReader.ReadSubTree
. It will return a new xmlReader instance against the current element and all its child elements. So, move to the first child of the root, call ReadSubtree, write all those nodes, call Read() using the original reader, and loop until done.
Upvotes: 0
Reputation: 3143
Here is a low memory footprint script to do it in the free firstobject XML editor (foxe) using CMarkup file mode. I am not sure what you mean by no interdependent top nodes, or tag checking, but assuming under the root element you have millions of top level elements containing object properties or rows that each need to be kept together as a unit, and you wanted say 1 million per output file, you could do this:
split_xml_15GB() { int nObjectCount = 0, nFileCount = 0; CMarkup xmlInput, xmlOutput; xmlInput.Open( "15GB.xml", MDF_READFILE ); xmlInput.FindElem(); // root str sRootTag = xmlInput.GetTagName(); xmlInput.IntoElem(); while ( xmlInput.FindElem() ) { if ( nObjectCount == 0 ) { ++nFileCount; xmlOutput.Open( "piece" + nFileCount + ".xml", MDF_WRITEFILE ); xmlOutput.AddElem( sRootTag ); xmlOutput.IntoElem(); } xmlOutput.AddSubDoc( xmlInput.GetSubDoc() ); ++nObjectCount; if ( nObjectCount == 1000000 ) { xmlOutput.Close(); nObjectCount = 0; } } if ( nObjectCount ) xmlOutput.Close(); xmlInput.Close(); return nFileCount; }
I posted a youtube video and article about this here:
http://www.firstobject.com/xml-splitter-script-video.htm
Upvotes: 5
Reputation: 25775
I think you'll have to split manually unless you are interested in doing it programmatically. Here's a sample that does that, though it doesn't mention the max size of handled XML files. When doing it manually, the first problem that arises is how to open the file itself.
I would recommend a very simple text editor - something like Vim. When handling such large files, it is always useful to turn off all forms of syntax highlighting and/or folding.
Other options worth considering:
EditPadPro - I've never tried it with anything this size, but if it's anything like other JGSoft products, it should work like a breeze. Remember to turn off syntax highlighting.
VEdit - I've used this with files of 1GB in size, works as if it were nothing at all.
Upvotes: 5