Reputation: 105

MarkLogic cts query to return distinct values in Data Hub

I'm looking for a MarkLogic cts query that I can use as a source query in Data Hub to return the distinct values on a combination of json property paths.

For example, I 200K+ docs with this structure:

{
    "name": "Bentley University",
    "unit": null,
    "type": "University or College",
    "location": {
        "state": {
            "code": "MA",
            "name": "Massachusetts"
        },
        "division": "New England",
        "region": "Northeast",
        "types": [
            "School-College",
            "University"
        ]
    }
}

I would like to have cts query that returns the distinct name + location/state/code values.

I've tried using cts.jsonPropertyScope() but that return entire docs. I just want the distinct values returned.

Upvotes: 1

Answers (1)

Mads Hansen

Reputation: 66783

If you had a range index on those two fields, then you could do this very easily with cts.valueCoOcurrences().

Returns value co-occurrences (that is, pairs of values, both of which appear in the same fragment) from the specified value lexicon(s). The values are returned as an ArrayNode with two children, each child containing one of the co-occurring values. You can use cts.frequency on each item returned to find how many times the pair occurs. Value lexicons are implemented using range indexes; consequently this function requires a range index for each input index reference. If an index or lexicon is not configured for any of the input references, an exception is thrown.

For example, with path-range-indexes on the /name and /location/state/code then the query would look like this:

cts.valueCoOccurrences(cts.pathReference("/name"), cts.pathReference("/location/state/code"))

Without indexes, then the brute force method that reads all of the documents would look something like this:

const nameAndCode = new Set();
for (const doc of cts.search(cts.andQuery([cts.jsonPropertyScopeQuery("name", cts.trueQuery()), 
                                           cts.jsonPropertyScopeQuery("code", cts.trueQuery())])) ) {
  const obj = doc.toObject()
  nameAndCode.add(obj.name+","+obj.location.state.code);
}
Array.from(nameAndCode)

But it might be slow, and you run the risk of blowing an Expanded Tree Cache error if all of the docs can't be read at once.

You could also do some sort of iterative search to sample some documents, add their values to the set, and then use the accumulated values to exclude docs in the next search until it no longer returns any documents or hits some limit of number of searches.

Upvotes: 1

MarkLogic cts query to return distinct values in Data Hub

Answers (1)

Related Questions