Bomberlatinos9
Bomberlatinos9

Reputation: 810

Producer/Consumer pattern the Java way

I have to resolve a problem close to parsing a huge file like, 3 GB or higher. Well, the file is structured how a pseudo xml file like:

<docFileNo_1>
<otherItems></otherItems>
<html>
<div=XXXpostag>
</html>

</docFileNo>
   ... others doc... 
<docFileNo_N>
<otherItems></otherItems>

<html>
<div=XXXpostag>
</html>

</docFileNo>

.......

In an recent post :"http://stackoverflow.com/questions/4355107/parsing-a-big-big-not-well-formed-file-with-java", I have encountered an interesting solution to my problem.. So I have thought to implement my application parser like multithread:

  1. After collect in a strinbUilder the text context with a tag until <\html>, I return the Stringbuilder.
  2. After return the Stringbuilder, I extract the text content of html page throught CSS rules. I obtain that with an html parser, JSOUP http://jsoup.org/. After extracted the content of html page, I must to save that content on a file.

So.. Put my attention to the step 1) and 2), I think to separate the sequencial pattern with a multithread way like:

  1. After Reading a chunk of the file (line by line until obtain from .. to <\html>), I append the line to an Stringbuilder.
  2. For an StringBuilder I create a thread that support a code to 2.1 Parse the html and extract the text content. 2.2 Save the text content in a file.

So I have a doubt..

  1. How many Threads I must to create ?? Is possible that I must create a thread for all Stringbuilder created?? It don't bring to memory problem??
  2. How can I do to obtain the exact number of thread that had worked well??
  3. How Can I know how many threads have finished ?? I have to wait that all threads have finished to terminate my work??

For my doubts...the point 1, I don't know how really resolve it. For point 2, I think that i could implement threads like inner class of the class that parsing a file, and so i can have a static counter incremented by all threads that have finished. For the point 3, I think that is similar point 2, but I don't know how to do wait my application....

Someone could me suggest somthing to resolve my doubts?? thanks :)

Upvotes: 2

Views: 971

Answers (1)

Peter Lawrey
Peter Lawrey

Reputation: 533530

If you have a decent, efficient parser, it should be able to parse the data as fast as you can read it. I suggest you look at make sure this is the case and you will be able to use one thread (possibly a separate one to do the reading)

3 GB isn't that huge. You should be able to read/parse it in under a minute. Much of that time will be just reading the file off disk. The cost is likely to be what you do with the parsed information and that is that will be worth passing on to one or more additional threads.

To chain data between two threads (one for read, one for processing) you can use either an Exchanger or PipedOutputStream/PipedInputStream. The exchanger is more efficient but the Piped stream is easier to integrate with a parser.

Upvotes: 1

Related Questions