Reputation: 169
I have a few XML documents in marklogic which have the structure
<abc:doc>
<abc:doc-meta>
<abc:meetings>
<abc:meeting>
</abc:meeting>
<abc:meeting>
</abc:meeting>
</abc:meetings>
</abc:doc-meta>
</abc:doc>
We can have more than one <abc:meeting>
element under the <abc:meetings>
element.
I am trying to write a cts:search
query to get only documents that have more than one <abc:meeting>
element in the document.
Please advise
Upvotes: 1
Views: 744
Reputation: 2236
It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search
plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings
and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".
Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings
with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query
on abc:meetings
and multiple="true".
Upvotes: 2
Reputation: 20414
This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.
The overall simplest solution would be to add a count attribute on abc:meetings
, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.
You could also just search on the presence of abc:meeting
with cts:element-query()
, and append an XPath predicate to count the number of elements afterwards. Something like:
cts:search(
collection(),
cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]
If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.
I played with the thought of leveraging cts:near-query()
, but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance
option on a double cts:element-query()
wrapped in a cts:near-query()
. It might help optimize the previous option a little though.
Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:
https://github.com/grtjn/doc-count-udf
and
http://docs.marklogic.com/guide/app-dev/aggregateUDFs
HTH!
Upvotes: 4