Reputation: 57
I have the following situation in MarkLogic. I have a database with a large number of documents which each hold hundreds of fields. Now let's say, that for one of the fields I want to retrieve all the values from a large set of documents (using cts:search
). What are my options?
Obviously, I could use a range index here. However, when I am exploring my dataset this would mean that I have to apply range indexes to all my fields, which seems.... overkill. In addition, I am perfectly fine with each query taking a little while.
So I have tried to run a simple cts:search query: cts:search(//Cost, cts:collection-query("myCollection))
. This function returns the value (and element) of the element I am interested in and works just fine when I am looking for less than 10k documents. However, when I am exploring a set with 1m documents I get an XDMP-EXPNTREECACHEFULL error which suggests that MarkLogic is actually opening all the documents before retrieving the XML node and returning it to my query-console.
Is there any way to change this query so that it at leasts returns a result?
I have tried running unfiltered searches and using xdmp:eval
to isolate the transaction, but so far no result.
Upvotes: 1
Views: 539
Reputation: 3732
This sounds like a classic 'Data Warehouse BI Query'. In which case the concepts of traditional Data Warehouse databases can be applied. BI queries of this sort are exemplified by use of 'columar queries' ( e.g. aggregates over all 'rows' but sliced by one or a few 'columns' -- the classic 'sum/min/max/avg of X across all rows' Hence was born the column store database and the modern data lake variants (Presto, Athena, and in some ways Hadoop). The MarkLogic way is formost the use of range indexes. You mentioned youd' have to apply index over "All my fields" -- yet you also say the aggregate function is 'One Field' . This is where one can mistakenly conflate the uses of indexes. I believe your case is such that A) The "Query for candidate documents" is complex ( possibly 1000's of fields') BUT B) The values needed for computation is ONE filed (or few)
You can solve this by putting a targeted range query (or similar, including a semantic index) on just the field "B", then using the universal index, or if needed a few select enhancements for purpose "A". The result maybe resolvable entirely (or largely) by indexes
"Data Wherehouse Trick 2" Create a temporary DB and forest (ideally on a different storage). Preprocess the source data and filter it down, likely re-normalized to a bare minimal representation with as few seperate element/attribute types as possible. Query from that Given enough ram this can be done entirely in-memory (using maps or arrays)
Compromise; Use multiple passes -- First pass identify the document IDs which contain the values of interest via a query that ideally is fully index resolvable. If not, break into batches as above. The document IDs will likely fit in a session or server scope variable. Query for ONLY the necessary value with a precise cts:search or xpath expression that retrieves only the 1 value as a sequence. Having the document ID's upfront force the query plan to not need to open 'possible documents' only those given. The IDs are sufficient to resolve AND and OR queries (document inclusion ) against any other index directly.
Upvotes: 0
Reputation: 66781
One option would be to execute a CORB job, which will fan out the work into individual module executions with multiple threads.
That would avoid the chance of expanded tree cache errors, since each document is processed in a single module execution and the results are then collected in a file. You can also enable the DISK-QUEUE option, which allows you to process an extremely large set of URIs without having to fit them all in memory in the JVM.
The CORB job would select all of the URIs of documents that are in the collection and have the element that you want to report on in the URIS-MODULE. The PROCESS-MODULE returns the values from those Cost
elements. Configure the job to write the results of the process module to a single file using the PROCESS-TASK, and then use a POST-BATCH-TASK to sort and dedup the values.
Below is an example CORB options file to configure a job to achieve what you want. You will need to adjust the XCC-CONNECTION-URI to connect to your database.
# how to connect to the database
XCC-CONNECTION-URI=xcc://myUsername:myPassword@localhost:8200
# An inline XQuery module to find the URIs of docs that have a Cost element and are in the myCollection
URIS-MODULE=INLINE-XQUERY|declare variable $URIS := cts:uris("", (), cts:and-query(( cts:collection-query("myCollection"), cts:element-query(xs:QName("Cost"), cts:true-query()) )) ); count($URIS), $URIS
# for every URI, extract and return the value of the Cost element(s)
PROCESS-MODULE=INLINE-XQUERY|declare variable $URI as xs:string external; fn:doc($URI)//Cost/string()
# write the results of the process module execution to a file
PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask
# the name of the output file to write (use full path, or also use EXPORT-FILE-DIR)
EXPORT-FILE-NAME=cost-values.txt
# after batch processing is complete, modify the output file with a post-batch-task
POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask
# sort and dedup the values in the export file
EXPORT-FILE-SORT=ascending|distinct
# how many threads you want processing the docs
THREAD-COUNT=10
DISK-QUEUE=true
# A location to store temporary files required for the disk queue, sorting and dedup, etc. (default is Java temp dir)
#TEMP-DIR=/temp
# Override the TEMP-DIR location with a different path to be used for the DISK-QUEUE
#DISK-QUEUE-TEMP-DIR=
Upvotes: 0
Reputation: 61
Generally, very large queries are to compute averages or trends, and a sample is adequate and appropriate. You can use a cts.search() with the "score-random" option to do a random sample. This causes the ordering of the "search" (that's search engine terminology - but of course it is really a query) to be random, so the first 10,000 items will be a random sample subject to your query condition.
But if you need to process the entire databsae or data set, taskbot, corb, range-indexing, or materializing the data on ingest are all options.
Upvotes: 1
Reputation: 2236
I believe Taskbot - https://github.com/mblakele/taskbot - will help avoid filling the expanded tree cache since it splits up the work into a user-defined number of transactions. You are correct in that ML needs to load each document to get values in the absence of a range index. Taskbot will at least ensure you get results back by avoiding e.g. loading a million documents in a single transaction.
Upvotes: 1