Reputation: 71

MarkLogic - getting distinct values

I have a database containing XML documents that look roughly as such:

<document>
  <question_item>
    <question>What is your name?</question>
  </question_item>
  <question_item>
    <question>What is your address?</question>
  </question_item>
...
</document>

I want to be able to take a search term and then return a distinct list of questions where that term is found e.g. searching for "name" with the data above, would return one result, "What is your name?".

I have successfully implemented this with fn:distinct-values, but obviously this is not efficient.

I want to implement this with CTS. I have tried the following:

for $question in cts:element-values(
  xs:QName('question'),(),(), 
  cts:element-word-query(xs:QName("question"), "name"))
return $question

However this causes questions to come back that do not have "name" in the question text. e.g. in the example above, both questions are returned. I think this is because the query I am using is being passed unfiltered and therefore it is returning any questions from a fragment if there is a match on that fragment.

Is this assumption correct?

What can I do to achieve what I want to do - efficiently?

Thanks!

Upvotes: 1

Answers (2)

grtjn

Reputation: 20414

Have you considered saving each question_item in a separate file? That way you would not need filtering, and you could run your code unchanged.

HTH!

Upvotes: 2

wst

Reputation: 11771

That's correct; cts:element-values() is a lexicon-type function, so it runs unfiltered.

The most efficient way to do this is probably to use a matching lexicon function like cts:element-value-match:

cts:element-value-match(xs:QName('question'), "* name*")

The catch is that this uses range indexes directly to do the matching, which don't have some of the features of cts:search-based queries, like linguistic stemming, but are the fastest. So for example, to handle all the cases where you might want to match "name", you might have to build a more elaborate set of queries:

cts:element-value-match(xs:QName('question'), ("* name?", "name *", "* name *"))

If the limitations of wildcards don't present any problems to your application, then this is the most efficient way to query those values, given the structure of your documents.

One compromise solution that still uses cts:queries and may be fast enough for your purposes is to filter the values after querying them:

for $v in cts:element-values(xs:QName('question'), (), (),
  cts:element-word-query(xs:QName('question'), 'name'))
where cts:contains($v, cts:word-query('name'))
return $v

Upvotes: 5

MarkLogic - getting distinct values

Answers (2)

Related Questions