Reputation: 107
I have XML file with >100MB (more than 20L lines) and I cannot pass it directly to my one of the process. I need to split in to multiple smaller files by assuming some separator and I tried using FileSystemObject in VBScript as well as BATCH file option. Both takes more than 8 mins to read and create first small file with 10000 lines.
Please convey some good performance oriented options for this task.
Appreciate your help.
Method1:
Function SplitXML()
Dim oSrcFile, oTgtFile, strHeader, intFiles, strContent, intSize
Dim arrLines()
Set oFSO = CreateObject("Scripting.FileSystemObject")
strFilePath = Application.GetOpenFilename
intLinesToSplit = InputBox("Enter the No of Lines to split with for each file:", "XML Splitter", 10000)
strTgtPath = Replace(strFilePath, oFSO.GetFileName(strFilePath), "")
strFileName = Replace(oFSO.GetFileName(strFilePath), ".xml", "")
Set oSrcFile = oFSO.OpenTextFile(strFilePath)
'Headers
strHeader = "": strContent = ""
Do
strTemp = oSrcFile.ReadLine
strHeader = strHeader & vbNewLine & strTemp
Loop While InStr(1, strTemp, "</BLHeader>", 1) <= 0
'Split
intTemp = 0: intFiles = 0: blnNewFile = True: intSize = -1
Do While Not oSrcFile.AtEndOfStream
intTemp = intTemp + 1
'Content
'intSize = intSize + 1
'ReDim Preserve arrLines(intSize)
'arrLines(intSize) = oSrcFile.ReadLine
strTemp = oSrcFile.ReadLine
strContent = strContent & vbNewLine & strTemp
If intTemp >= intLinesToSplit Then
If InStr(1, strTemp, "</EndingTag>", 1) > 0 Then
'Add Header
intFiles = intFiles + 1
Set oTgtFile = oFSO.CreateTextFile(strTgtPath & strFileName & "_" & intFiles & ".xml", True)
oTgtFile.WriteLine strHeader
'Add Content
oTgtFile.WriteLine strContent 'Join(arrLines, vbNewLine)
'Add tail
oTgtFile.WriteLine "</FinalFileTag>"
oTgtFile.Close
End If
End If
Loop
oSrcFile.Close
End Function
Method2:
@echo off
setlocal EnableDelayedExpansion
set InFile=c:\ee\EE28352646\in.txt
set OutDir=c:\ee\EE28352646
REM Can not be larger than 2147483648 !!!
set MaxLines=1000000
if not exist "%InFile%" (
echo *ERROR* Input file does not exist!
exit /b
)
if not exist "%OutDir%\" (
echo *ERROR* Output folder does not exist!
exit /b
)
for %%A in ("%InFile%") do (
set Name=%%~nA
set Ext=%%~xA
)
set /a Line=MaxLines+1
set File=0
for /f "usebackq tokens=*" %%A in ("%InFile%") do (
set /a Line+=1
if !Line! GTR %MaxLines% (
set /a File+=1
set OutFile=%OutDir%\%Name%_!File!%Ext%
if exist "!OutFile!" del "!OutFile!"
set Line=1
)
echo.%%A>>"!OutFile!"
)
Upvotes: 0
Views: 1700
Reputation: 200293
The best way to split any XML file is -as always- to use an actual XML parser for the task. Assuming you have an XML file like this:
<?xml version="1.0"?>
<Foo>
<Bar>some</Bar>
<Bar>other</Bar>
</Foo>
and want to split it into separate files
<?xml version="1.0"?>
<Foo>
<Bar>some</Bar>
</Foo>
and
<?xml version="1.0"?>
<Foo>
<Bar>other</Bar>
</Foo>
you could do something like this:
numFiles = 4 'number of output files
Set src = CreateObject("Msxml2.DOMDocument.6.0")
src.async = False
src.load "C:\path\to\input.xml"
Set nodes = src.selectNodes("//Bar")
numNodes = nodes.length \ numFiles 'number of nodes per output file
Set xml = Nothing
For i = 0 To nodes.length - 1
'create a new XML object on the first iteration and every time numNodes
'nodes have been added to the current object
If i Mod numNodes = 0 Then
If Not xml Is Nothing Then
'if we already have an XML object: save it to a file
Set prolog = xml.createProcessingInstruction("xml", "version='1.0'")
xml.insertBefore prolog, xml.childNodes(0)
xml.save "C:\path\to\output" & (i \ numNodes - 1) & ".xml"
End If
Set xml = CreateObject("Msxml2.DOMDocument.6.0")
Set root = xml.createElement("Foo")
xml.appendChild root
End If
root.appendChild nodes.item(i)
Next
'save unsaved XML object
If Not xml Is Nothing Then
Set prolog = xml.createProcessingInstruction("xml", "version='1.0'")
xml.insertBefore prolog, xml.childNodes(0)
xml.save "C:\path\to\output" & (i \ numNodes - 1) & ".xml"
End If
Upvotes: 2