Reputation: 2928
I'm working on a piece of code to split files. I want to split flat file (that's ok, it is working fine) and xml file. The idea is to split based of a number of files to split: I have a file, and I want to split it in x files (x is a parameters). I'm doing the split by taking the size of the file and spliting the size by the number of files to split. Then, mysolution was to use a BufferedReader and to use it like
while ((n = reader.read(buffer, 0, buffer.length)) != -1) {
{
The main problem is that for the xml file I cannot just split it, but I have to split it based on a block delimited by a start xml tag and end xml tag:
<start tag>
bla bla xml stuff
</end tag>
So I cannot cut a block at the middle. So if when I'm at the half of a block, is the size of my new file is greater than my max, I will have to read until the end of the tag, and then, to start a next file.
The problem is that I have all sort of cases, and is a bit difficult to search the end tag. - the block reads a text until the middle of the end tag - the block reads a text until the end of the end tag, and no more other caracter after - etc and in the same time to have a loop and read the next block. Some times the end of a block concatenated with the start of the next one, I have the end xml tag. I hope you get the idea.
My question is, does anyone have some algorithm that does that more accurate and who i treating all special cases ?
The idea is to split the file as quickly as possible. I did not want to use a lib to treat the file as a xml file because the size of a block cand be smaller or very large, and I don't know if the memory will be enough. Or there is some lib that does not load all in memory?
Thanks alot.
Here below an example of my xml file;
<?xml version="1.0" encoding="UTF-8" ?>
<myTag service="toto" version="1.5.18" >
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<inventoryDate>2009-12-31</inventoryDate>
<!-- reporting date -->
<processingDate>2010-01-29T00:00:00</processingDate>
</myTag>
I forgot one thing: my xml file could be all written on the first line, so I cannot gues that one line has one tag.
Upvotes: 1
Views: 3046
Reputation: 3377
The best tool to split xml files is, hands down, vtd-xml. Not only is it super fast, it is also super easy to code your app, eg using xpath.
Upvotes: 0
Reputation: 6956
Although you have stated that you don't want to use a lib that treats it as an XML file. You might want to consider using SAX.
Using SAX, rather than DOM, your fears about memory are allayed, as the whole file is not loaded into memory, but events occur as your application reads the file and encounters XML landmarks such as start and end tags.
SAX is also pretty fast.
This quickstart guide should help: http://www.saxproject.org/quickstart.html
Upvotes: 1
Reputation: 420951
Provided the end-tags that you're after are on lines by them selves, you could simply do
String line;
while ((line = reader.readLine()) != null)
instead of:
while ((n = reader.read(buffer, 0, buffer.length)) != -1)
and then split into a new file whenever line
matches an end-tag and the current file is large enough.
If they are not lines by them selves, you could line.find(...)
the tag instead, split the line, put the first part in the current file, and save the second part for the next file.
However, as pointed out in the comments, the splitted xml-files will be far from valid xml, unless you take care of a few things. For instance, the first part may look like:
<?xml version="1.0" encoding="UTF-8" ?>
<myTag service="toto" version="1.5.18" >
<endOfPeriodTradeNotification version="1.5.18">
.............
</endOfPeriodTradeNotification>
<endOfPeriodTradeNotification version="1.5.18">
and that's not valid xml. neither is
<inventoryDate>2009-12-31</inventoryDate>
<!-- reporting date -->
<processingDate>2010-01-29T00:00:00</processingDate>
</myTag>
Upvotes: 0