Reputation: 54094
I have a function that does xml parsing. I want to make the function thread safe, but also as optimized (less blocking) as possible.
In short code is something as follows:
public Document doXML(InputStream s)
{
//Some processing.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
Document xmlDoc = parser.parse(is);
return xmlDoc;
}
But I do not want to create a new DocumentBuilderFactory or DocumentBuilder in each call.
I want to reuse factory and parser, but I am not sure they are thread-safe. So what is the most optimal approach?
1) Cache a DocumentBuilderFactory in a class field and synchronize the factory.newDocumentBuilder(); so that each thread has its own instance of DocumentBuilder
2) Cache a DocumentBuilderFactory and DocumentBuilder and synchronize parser.parse(is); per thread
I think (2) is best, but is it safe to do it? Also can I avoid blocking by synchronized? I would like it to be as fast as possible.
Thanks?
Upvotes: 3
Views: 4368
Reputation: 13249
I ran into some performance problems in a similar situation. I was creating the factory objects on each use to avoid thread problems (10's per second). The XML implementation in that (admittedly old) platform did some relatively slow lookup logic for a service-provider class.
My tuning was to determine the answer that resulted and configure it via command-line properties. That caused the lookup to be skipped.
-Djavax.xml.parsers.DocumentBuilderFactory=com.example.FactoryClassName
-Djavax.xml.transform.TransformerFactory=com.example.OtherFactoryClassName
The frustrating thing was that the lookup code had caching logic if a class was found. But no caching of a miss (nothing found, use default). Slightly better lookup caching that handled the negative case would have made this unneeded.
Is this still needed? Requires testing in your environment. I used truss on Solaris to notice the very frequent file operations resulting from that lookup logic.
Upvotes: 1
Reputation: 40266
If you are reusing thread (as in a thread pool) you can declare your DocumentBuilderFactory to be thread local. There is the overhead of creating a new set for each thread, but as I said, if you are reuising the subsequent overhead is very low.
final ThreadLocal<DocumentBuilderFactory> documentBuilderFactor = new ThreadLocal<DocumentBuilderFactory>(){
public DocumentBuilderFactory initialValue(){
return DocumentBuilderFactory.newInstance();
}
}
public Document doXML(InputStream s)
{
//Some processing.
DocumentBuilderFactory factory = documentBuilderFactor.get();
DocumentBuilder parser = factory.newDocumentBuilder();
Document xmlDoc = parser.parse(is);
return xmlDoc;
}
Here you will only create one DocumentBuilderFactory for each thread.
I dont know if DocumentBuilder is thread safe when parsing (is it immutable?). But if DocumentBuilder is thread-safe when parsing you can use the same mechanism as I stated.
This resolution would make the overall throughput as fast as possible.
Note: This wasnt tested or compiled just gives an idea of what I am referring to.
Upvotes: 5
Reputation: 18455
2) would be thread safe but your app will only ever be able to parse one doc at a time.
Why not just use the code you have? Does
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
have a demonstrably unacceptable overhead?
Upvotes: 2
Reputation: 28703
If you want to avoid synchronized blocking you should make sure you use atomic operations. Behavior of javax.xml.parser.*
depends on implementation (you can specify the implementation using system properties, or call the implementing code). Depending on the threads count, and the load weight for each thread, it may be reasonable to control parser object creation. You should choose between a new parser creation or waiting for a parser. The code can create a pool of parsers when it starts, and then threads get parsers from the pool, which blocks when there is no free parser. Once a thread acquired a parser, it parses data, resets the parser and puts back to the pool. You can always control time/memory usage by the length of the pool.
Upvotes: 1