Reputation: 440
I'm trying to find an example on how to use the discard-document function of Saxon. I have about 50 files 40mb each, so they are using about 4,5GB of memory in my xquery script.
I've tried to use saxon:discard-document(doc("filename.xml"))
after every call to the XML file, but maybe this is not the correct way to do it? There is no difference in memory usage after using that.
I also found some questions about its usage (7 years ago), and they were suggesting running the xpath using discard-document. But I have many calls to that document, so I would have to replace all declarations with saxon:discard-document(doc("filename.xml"))/xpath/etc/etc/etc
Thanks
Upvotes: 1
Views: 232
Reputation: 440
I think this is a good question and there is not much information available so I will try to answer it myself.
Here is an example on how to use saxon:discard-document:
declare function local:doStuffInDocument($doc as document-node()) {
$doc//testPath
};
let $urls := ("http://url1", "http://url2")
let $results :=
for $url in $urls
let $doc := saxon:discard-document(doc($url))
return local:doStuffInDocument($doc)
return $results
By using a similar code I managed to reduce the memory consumption from 4+GB to only 300MB.
To understand what discard-document does, here is a great comment from Michael Kay found at the SF maillist:
Just to explain what discard-document() does:
Saxon maintains (owned by the Transformer/Controller) a table that maps document URIs to document nodes. When you call the document() function, Saxon looks to see if the URI is in this table, and if it is, it returns the corresponding document node. If it isn't, it reads and parses the resource found at that URI. The effect of saxon:discard-document() is to remove the entry for a document from this mapping table. (Of course, if a document is referenced from this table then the garbage collector will hold the document in memory; if it is not referenced from the table then it becomes eligible for garbage collection. It won't be garbage collected if it's referenced from a global variable; but it will still be absent from the table in the event that another call on document() uses the same URI again.)
And another one from Michael Kay found at the Altova maillist:
In Saxon, if you use the doc() or document() function, then the file will be loaded into memory, and will stay in memory until the end of the run, just in case it's referenced again. So you will hit the same memory problem with lots of small files as with one large file - worse, in fact, since there is a significant per-document overhead.
However, there's a workaround: an extension function saxon:discard-document() that causes a document to be discarded from memory by the garbage collector as soon as there are no more references to it.
Upvotes: 2
Reputation: 163595
It's probably useful to understand what actually happens below the covers. The doc()
function looks in a cache to see if the document is already there; if not, it reads the document, adds it to the cache, and then returns it. The discard-document()
function looks to see if the document is in the cache, removes it if it is, and then returns it. By removing the document from the cache, it makes it eligible for garbage collection when the document is no longer referenced. If using discard-document has no effect on memory consumption, that's probably because there is something else still referencing the document - for example, a global variable.
Upvotes: 1