Reputation: 2620
I am using ML 9
In MarkLogic database, there are 2.8 million xml documents. I just want to get all the unique element names.
As the database size is too large, what is the best & fastest way to get the unique element names?
Upvotes: 1
Views: 125
Reputation: 66783
You could run a CORB job that selects all of the URIs from the database in your URIs module, and then returns a distinct list of element names using either name()
or local-name()
in the process module, with the PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
option to write all of the output to a single file, and the POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
and EXPORT-FILE-SORT=ascending|distinct
options to dedup, and generate a distinct list of element names from the database in a text file.
An example job with all of the necessary options, except for the XCC-CONNECTION-URI
:
# Inline module to select all URIs
URIS-MODULE=INLINE-XQUERY|xdmp:estimate(fn:doc()), cts:uris("",(),cts:true-query())
# Inline module to return a distinct list of element names in the document on a separate line
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI as xs:string external; string-join(fn:distinct-values(fn:doc($URI)//*/name())," ")
# Write the results of each process module to a single file
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=element-names.txt
# After the batch processing is completed, sort and dedup the element names
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
EXPORT-FILE-SORT=ascending|distinct
THREAD-COUNT=10
Upvotes: 2