Reputation: 169

Search query to find documents that have multiple element

I have a few XML documents in marklogic which have the structure

<abc:doc>
  <abc:doc-meta>
     <abc:meetings>
        <abc:meeting>
        </abc:meeting>
        <abc:meeting>
        </abc:meeting>
     </abc:meetings>
  </abc:doc-meta>
</abc:doc>

We can have more than one <abc:meeting> element under the <abc:meetings> element. I am trying to write a cts:search query to get only documents that have more than one <abc:meeting> element in the document. Please advise

Upvotes: 1

Answers (2)

rjrudin

Reputation: 2236

It boils down to how many "a few" is. If it's thousands or fewer, than what grtjn presents above for a cts:search plus an XPath expression will work fine. If it's more, I'd add the count attribute to abc:meetings and then use a pre-commit trigger (e.g. on the collection of these documents) to ensure that the count attribute value is kept in sync. You'd need a range index to be able to query for "Documents that have a count of meetings of 2 or greater".

Of course, if all you need to query on is whether there's more than one meeting, then just add a "multiple" attribute to abc:meetings with a value of "true". Then you don't need a range index - you can do a cts:element-attribute-value-query on abc:meetings and multiple="true".

Upvotes: 2

grtjn

Reputation: 20414

This is tricky. Ideally, you'd want to drive searches from indexes for best performance. Unfortunately, MarkLogic doesn't keep track of element counts in its universal index, and aggregating counts from a range index can be cumbersome.

The overall simplest solution would be to add a count attribute on abc:meetings, and then add a range index on that. It does mean you'd have to change your data, and you'd have to keep that attribute in synch with each change.

You could also just search on the presence of abc:meeting with cts:element-query(), and append an XPath predicate to count the number of elements afterwards. Something like:

cts:search(
  collection(),
  cts:element-query(xs:QName('abc:meeting'), cts:true-query())
)[count(.//abc:meeting) > 1]

If not many documents contain meetings, this might work fairly well for you, but it still requires pulling up all documents containing meetings, hence could be expensive.

I played with the thought of leveraging cts:near-query(), but that is driven on word positions, so depends on the actual amount of tokens inside a meeting. If that were always an exact number of tokens (unlikely I'd guess), you could use the minimal-distance option on a double cts:element-query() wrapped in a cts:near-query(). It might help optimize the previous option a little though.

Most performant option I can think of right now, involves adding a User-Defined aggregate Function. It unfortunately means compiling c++ code. I happen to have written such a UDF in the past, that you should be able to use as-is after compilation and installation. For details see:

https://github.com/grtjn/doc-count-udf

and

http://docs.marklogic.com/guide/app-dev/aggregateUDFs

HTH!

Upvotes: 4

Search query to find documents that have multiple element

Answers (2)

Related Questions