Dixit Singla
Dixit Singla

Reputation: 2620

How to get unique element names from XMLs stored in MarkLogic DB, when DB size is too big?

I am using ML 9

In MarkLogic database, there are 2.8 million xml documents. I just want to get all the unique element names.

As the database size is too large, what is the best & fastest way to get the unique element names?

Upvotes: 1

Views: 125

Answers (1)

Mads Hansen
Mads Hansen

Reputation: 66783

You could run a CORB job that selects all of the URIs from the database in your URIs module, and then returns a distinct list of element names using either name() or local-name() in the process module, with the PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask option to write all of the output to a single file, and the POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask and EXPORT-FILE-SORT=ascending|distinct options to dedup, and generate a distinct list of element names from the database in a text file.

An example job with all of the necessary options, except for the XCC-CONNECTION-URI:

# Inline module to select all URIs
URIS-MODULE=INLINE-XQUERY|xdmp:estimate(fn:doc()), cts:uris("",(),cts:true-query())

# Inline module to return a distinct list of element names in the document on a separate line
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI as xs:string external; string-join(fn:distinct-values(fn:doc($URI)//*/name()),"
")

# Write the results of each process module to a single file
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
EXPORT-FILE-NAME=element-names.txt

# After the batch processing is completed, sort and dedup the element names
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
EXPORT-FILE-SORT=ascending|distinct

THREAD-COUNT=10

Upvotes: 2

Related Questions