Engineer
Engineer

Reputation: 1

Xml parsing and writing txt file using multithread in java

I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.

example of xml file:

<text>
   <paragraph>
         <line>
             <character>g</character>
             <character>o</character>
                         .....
          </line>
          <line>
             <character>k</character>
                         .....
          </line>
   </paragraph>
</text>
<text>
   <paragraph>
         <line>
             <character>c</character>
                         .....
          </line>
   </paragraph>
</text>

example of text file:

go..
k..

c..

How can I parse many xml files and write many text files using multi thread in java as fast as I can?

Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?

I have no experience in multi thread. How should I build a multi-thread structure to be effective?

Any help is appreciated. Thanks in advance.

EDIT

I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?

Upvotes: 0

Views: 1137

Answers (3)

Michael Kay
Michael Kay

Reputation: 163458

If you write your code in XSLT (2.0 or later), using the collection() function to parse your source files, and the xsl:result-document instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.

This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.

LATER

I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character> elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:

<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
  <xsl:value-of select="."/>
</xsl:template>

If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.

If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.

If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.

Upvotes: 0

OldCurmudgeon
OldCurmudgeon

Reputation: 65851

Points to note in this situation.

  1. In many cases, attempting to write to multiple files at once using multi-threading is utterly pointless. All this generally does is exercise the disk heads more than necessary.
  2. Writing to disk while parsing is also likely a bottleneck. You would be better to parse the xml into a buffer and then writing the whole buffer to disk in one hit.
  3. The speed of your parser is unlikely to affect the overall time for the process significantly. Your system will almost certainly spend much more time reading and writing than parsing.
  4. A quick check with some real test data would be invaluable. Try to get a good estimate of the amount of time you will not be able to affect.
    • Determine an approximate total read time by reading a few thousand sample files into memory because that time will still need to be taken however parallel you make the process.
    • Estimate an approximate total write time in a similar way.
    • Add the two together and compare that with your total execution time for reading, parsing and writing those same files. This should give you a good idea how much time you might save through parallelism.

Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.

Upvotes: 3

The Beruriah Incident
The Beruriah Incident

Reputation: 3257

First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.

However, toward the actual question: Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.

The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?

Upvotes: 0

Related Questions