Robert
Robert

Reputation: 71

MarkLogic - Improving query

As a follow on to question: MarkLogic - getting distinct values

I have a document structure like this:

<document>
<question_item>
  <question>What is your name?</question>
  <answer>Barney Rubble</answer>
</question_item>
<question_item>
  <question>What is your address?</question>
  <answer>Bedrock</answer>
</question_item>
...
</document>

Thanks to answer received on the other question I can now list all distinct questions in order of frequency as such:

 for $v in cts:element-values(xs:QName('question'), (), (),
    cts:element-word-query(xs:QName('question'), 'name'))
 where cts:contains($v, cts:word-query('name'))
 order by cts:frequency($v)
 return concat($v,concat("-",cts:frequency($v)))

I would love to able to also include with each distinct question, the top x most common answers and their count, e.g. Barney Rubble (100), Fred Flintstone (59) etc.

Is there a way to do this reasonably efficiently? I know another option is to change the document format to have one document per question_item but I would prefer to avoid that for the moment, if possible.

Any help is greatly appreciated. Thanks!

Upvotes: 1

Views: 80

Answers (2)

wst
wst

Reputation: 11773

Just to elaborate on David Ennis' answer, ultimately, if you want to be able to do in-index "joins" where you can query answers based on the results of another query, it will be simpler to have only one question + answer per document/fragment.

Using your current schema and given a set of question values, if you index answers, the level of filtering you would get from the index will stop at the granularity of your document/fragment. So you only be able to list "every answer in any document containing a question equal to a supplied value." Then, since your next filtering step is based on the question the answer belonged to, you're stuck, since you only have answer strings with no context.

Without fragment roots or remodeling the XML, the solution would involve 1) getting every document matching the question list, 2) filtering the false positive questions out of the document, and 3) counting the answers. If you expect step 1 to return only a small number of documents, then the performance would probably be fine. Otherwise, you could see IO thrashing, where the data isn't cached and has to be retrieved from disk.

Upvotes: 2

The key is to harness a single range index for this type of faceting. However, the current model requires filtering -making it inefficient for your goal.

Below are some options for you to explore. If I could not change my model, then I would probably go with 1.1

Option 1)
To answer that question, I may have considered modelling the data differently (all answers to a distinct question in a single document per question. Then with a single range index, I could get the result by treating this as a facet. (giving the result you want via an in-memory index.

Option 1.1)
You could also keep your data as-is and still create these facet-friendly tables after a survey is created. De-normalized data is not bad when it serves a purpose like reducing code complexity or increasing performance.

Option 2)
Leave the data as-is and add a fragment root on the question-item element. This then allows MarkLogic to treat each of those smaller XML fragments separately. If you add a range index on the question and another on the answer, then you are able to iterate over all questions quickly and also get the faceted answers you want. This will cause the number of small fragments in the system to explode (one per answer). This is a viable solution, but one that I would avoid if I could re-model the data instead.

Option 3)
If you have a finite number of questions, then you could also : add an identifier to identify unique questions:

   <question_item question-key="q1">
     <question>What is your address?</question>
     <answer>Bedrock</answer>
   </question_item>

Then for each question, add a path range index on paths such as:

//question-item[@question-key="q1"]/answer
//question-item[@question-key="q2"]/answer
...
//question-item[@question-key="qn"]/answer

Then you can apply the facet approach per answer. But as you can see - if the number of questions are more than a handful, this will become cumbersome - bordering on silly.

Upvotes: 4

Related Questions