Reputation: 1
I have many xml file. Every xml file include too many line and tags. Here I must parse them and write .txt file with xml's file name. This needs to be done quickly. Faster the better.
example of xml file:
<text>
<paragraph>
<line>
<character>g</character>
<character>o</character>
.....
</line>
<line>
<character>k</character>
.....
</line>
</paragraph>
</text>
<text>
<paragraph>
<line>
<character>c</character>
.....
</line>
</paragraph>
</text>
example of text file:
go..
k..
c..
How can I parse many xml files and write many text files using multi thread in java as fast as I can?
Where should I start to solve the problem? Does the method that I use to parse affect speed ? If affect, Which method is faster then others?
I have no experience in multi thread. How should I build a multi-thread structure to be effective?
Any help is appreciated. Thanks in advance.
EDIT
I need some help. I used SAX for parsing. I made some research about Thread Pool,Multi-Thread, java8 features. I tried some code blocks but there was no change in total time. How can I add multiple threads structure or java8 features(Lambda Expressions,Parallelism etc.) in my code?
Upvotes: 0
Views: 1137
Reputation: 163458
If you write your code in XSLT (2.0 or later), using the collection()
function to parse your source files, and the xsl:result-document
instruction to write your result files, then you will be able to assess the effect of multi-threading simply by running the code under Saxon-EE, which applies multi-threading to these constructs automatically. Usually in my experience this gives a speed-up of around a factor of 3 for such programs.
This is one the benefits of using functional declarative languages: because there is no mutable state, multi-threading is painless.
LATER
I'll add an answer to your supplementary question about using DOM or SAX. From what we can see, the output file is a concatenation of the <character>
elements in the input, so if you wrote it in XSLT 3.0 it would be something like this:
<xsl:mode on-no-match="shallow-skip">
<xsl:template match="characters">
<xsl:value-of select="."/>
</xsl:template>
If that's the case then there's certainly no need to build a tree representation of each input document, and coding it in SAX would be reasonably easy. Or if you follow my suggestion of using Saxon-EE, you could make the transformation streamable to avoid the tree building. Whether this is useful, however, really depends on how big the source documents are. You haven't given us any numbers to work with, so giving concrete advice on performance is almost impossible.
If you are going to use a tree-based representation, then DOM is the worst one you could choose. It's one of those cases where there are half-a-dozen better alternatives but because they are only 20% better, most of the world still uses DOM, perceiving it to be more "standard". I would choose XOM or JDOM2.
If you're prepared to spend an unlimited amount of time coding this in order to get the last ounce of execution speed, then SAX is the way to go. For most projects, however, programmers are expensive and computers are cheap, so this is the wrong trade-off.
Upvotes: 0
Reputation: 65851
Points to note in this situation.
Parallelism is not always an answer to slow-running processes. You can often significantly improve throughput just by using appropriate hardware.
Upvotes: 3
Reputation: 3257
First, are you sure you need this to be faster or multithreaded? Premature optimization is the root of all evil. You can easily make your program much more complicated for unimportant gain if you aren't careful, and multithreading can for sure make things much more complicated.
However, toward the actual question: Start out by solving this in a single-threaded way. Then think about how you want to split this problem across many threads. (e.g. have a pool of xml files and threads, and each thread grabs an xml file whenever its free, until the pool is empty) Report back with wherever you get stuck in this process.
The method that you use to parse will affect speed, as different parsing libraries have different behavior characteristics. But again, are you sure you need the absolute fastest?
Upvotes: 0