Reputation: 105
I'm looking for a MarkLogic cts
query that I can use as a source query in Data Hub to return the distinct values on a combination of json property paths.
For example, I 200K+ docs with this structure:
{
"name": "Bentley University",
"unit": null,
"type": "University or College",
"location": {
"state": {
"code": "MA",
"name": "Massachusetts"
},
"division": "New England",
"region": "Northeast",
"types": [
"School-College",
"University"
]
}
}
I would like to have cts
query that returns the distinct name
+ location/state/code
values.
I've tried using cts.jsonPropertyScope()
but that return entire docs. I just want the distinct values returned.
Upvotes: 1
Views: 115
Reputation: 66783
If you had a range index on those two fields, then you could do this very easily with cts.valueCoOcurrences()
.
Returns value co-occurrences (that is, pairs of values, both of which appear in the same fragment) from the specified value lexicon(s). The values are returned as an
ArrayNode
with two children, each child containing one of the co-occurring values. You can usects.frequency
on each item returned to find how many times the pair occurs. Value lexicons are implemented using range indexes; consequently this function requires a range index for each input index reference. If an index or lexicon is not configured for any of the input references, an exception is thrown.
For example, with path-range-indexes on the /name
and /location/state/code
then the query would look like this:
cts.valueCoOccurrences(cts.pathReference("/name"), cts.pathReference("/location/state/code"))
Without indexes, then the brute force method that reads all of the documents would look something like this:
const nameAndCode = new Set();
for (const doc of cts.search(cts.andQuery([cts.jsonPropertyScopeQuery("name", cts.trueQuery()),
cts.jsonPropertyScopeQuery("code", cts.trueQuery())])) ) {
const obj = doc.toObject()
nameAndCode.add(obj.name+","+obj.location.state.code);
}
Array.from(nameAndCode)
But it might be slow, and you run the risk of blowing an Expanded Tree Cache error if all of the docs can't be read at once.
You could also do some sort of iterative search to sample some documents, add their values to the set, and then use the accumulated values to exclude docs in the next search until it no longer returns any documents or hits some limit of number of searches.
Upvotes: 1