Joe Glorioso
Joe Glorioso

Reputation: 123

How to generate keywords for documents stored in MarkLogic?

I need to generate a list of keywords for each document in a set of documents that are loaded into MarkLogic. I am considering running cts:distinctive-terms against the set of documents, but cannot figure out how to get a list of keywords for each document rather than a list of terms relevant to the set. Can anyone suggest a solution?

Upvotes: 2

Views: 298

Answers (2)

mblakele
mblakele

Reputation: 7842

Were you using the score=logtf option? When I tried that, the scores of stop-words went up quite a bit. If you think about it this makes sense: the database can no longer use IDF to weed them out. If you only want TF, though, you could filter using a stop-word list - as already suggested.

But logtfidf scoring should naturally penalize stop-words. You can set the min-val option or other options to tune the results. For example, here I set min-val to 27 because stop-words began to appear at 26. The right options will depend on the existing database content, because of IDF.

cts:distinctive-terms(
  text { 'I need to generate a list of keywords for each document in a set of documents that are loaded into MarkLogic. I am considering running cts:distinctive-terms against the set of documents, but cannot figure out how to get a list of keywords for each document rather than a list of terms relevant to the set. Can anyone suggest a solution?' },
  <options xmlns="cts:distinctive-terms"
   xmlns:db="http://marklogic.com/xdmp/database">
    <min-val>27</min-val>
    <use-db-config>false</use-db-config>
    <db:stemmed-searches>true</db:stemmed-searches>
    <db:word-searches>false</db:word-searches>
    <db:fast-phrase-searches>false</db:fast-phrase-searches>
  </options>)/cts:term/cts:word-query/cts:text/string()
=>
load
set
solution
term
document
list
keyword

Upvotes: 3

grtjn
grtjn

Reputation: 20414

Simply iterate over the docs of interest and call cts:distinct-terms for each doc separately:

for $doc in doc()
return
    cts:distinctive-terms($doc)

HTH!

Upvotes: 3

Related Questions