Reputation: 137
I want to search for the largest XML file in a MarkLogic database from the MarkLogic query console using XQuery. I can retrieve the size (bytes) of a document in the database using the following XQuery:
xdmp:binary-size(xdmp:unquote(xdmp:quote($doc),(),"format-binary")/binary())
where $doc
is the document for which i get the size in bytes.
Upvotes: 1
Views: 511
Reputation: 137
I found the following query useful:
(
for $doc in cts:uri-match('*.xml')
order by string-length(fn:doc($doc)) descending
return $doc
)[position() = 1]
The above query uses string-length
function to find the number of characters in the document. This query is useful when you have special characters in the document being searched.
If you want the number of bytes you can use xdmp:binary-size
as follows:
(
for $doc in cts:uri-match('*.xml')
order by xdmp:binary-size(xdmp:unquote(xdmp:quote(fn:doc($doc)),(),"format-binary")/binary()) descending
return $doc
)[position() = 1]
Upvotes: 1
Reputation: 61
It is true that there is no index on document size to quickly find the largest ones. But there are some options to find large documents.
One is to run a batch job that scans for large documents using the function above to compute the size. Also it's a little simpler to use the serialized length with XQuery string-length(xdmp:quote(doc($uri))) or JavaScript xdmp.quote(cts.doc("/my/uri/here")).length .
Corb or NiFi or spawning functions on the task server via xdmp.spawnFunction() can execute a big job like that over a period of time, where you would check each documents size and store a record or log an indicator if it was over some size limit. You would then search or grep for the largest size.
Sometimes, if you know the structure and some common terms that will be in a larger document, you can search for documents that contain a common "word" or "term" many times using cts.wordQuery("theCommonTerm") and the option "min-occurs=number". You need to adjust the min-occurs number to narrow things down to the largest documents, then run your size query just on those.
Upvotes: 0