Reputation: 810
I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:
<docFileNo_1>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
... others doc...
<docFileNo_N>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>
</docFileNo>
.......
In an recent post :"http://stackoverflow.com/questions/4355107/parsing-a-big-big-not-well-formed-file-with-java", I have encountered an interesting solution to my problem.. So I have thought to implement my application parser like multithread:
So.. Put my attention to the step 1) and 2), I think to separate the sequencial pattern with a multithread way like:
So I have a doubt..
For my doubts...the point 1, I don't know how really resolve it. For point 2, I think that i could implement threads like inner class of the class that parsing a file, and so i can have a static counter incremented by all threads that have finished. For the point 3, I think that is similar point 2, but I don't know how to do wait my application....
Someone could me suggest somthing to resolve my doubts?? thanks :)
Upvotes: 2
Views: 971
Reputation: 533530
If you have a decent, efficient parser, it should be able to parse the data as fast as you can read it. I suggest you look at make sure this is the case and you will be able to use one thread (possibly a separate one to do the reading)
3 GB isn't that huge. You should be able to read/parse it in under a minute. Much of that time will be just reading the file off disk. The cost is likely to be what you do with the parsed information and that is that will be worth passing on to one or more additional threads.
To chain data between two threads (one for read, one for processing) you can use either an Exchanger or PipedOutputStream/PipedInputStream. The exchanger is more efficient but the Piped stream is easier to integrate with a parser.
Upvotes: 1