Kevin Frey
Kevin Frey

Reputation: 59

Neo4j fulltext search. Don't score same word multiple times

i have question about the neo4j fulltext search. I am currently working on a database with a lot of species names and i came across some behaviour i am trying to avoid.

Consider a fresh neo4j db with 3 nodes (link to sandbox).

CREATE (:Term {name: "(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica"}),
(:Term {name: "Arabidopsis thaliana"}),
(:Term {name: "Arabidopsis thaliana x Arabidopsis arenosa"})

and one fulltext index

CREATE FULLTEXT INDEX TermName IF NOT EXISTS
FOR (n:Term)
ON EACH [n.name]

If i now run the following search:

CALL db.index.fulltext.queryNodes("TermName","Arabidopsis")
YIELD node, score
RETURN score, node.name

You will find the following:

So my query will return the complex "(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica" first, because it contains the search phrase three times. Is there a way to tell the fulltext search query to ignore duplicate words? So that "Arabidopsis thaliana" would be scored highest because it contains the least amount of characters not contained in the search phrase?

Upvotes: 1

Views: 131

Answers (1)

jose_bacoy
jose_bacoy

Reputation: 12684

I cannot find a configuration in the fulltext query where you can ignore or filter the duplicated words in the index.

However, I have a suggested alternative solution for your question. I will create a new property (call it name_clean_idx) which is a clone of name but the duplicated word(s) are removed. The trick is to replace the first occurrence of the duplicated words by * then do a replace all then put back the first occurrence of the duplicated word.

An APOC function apoc.coll.duplicatesWithCount is used to find duplicated words in the name and split() function to create a list of word(s) in the name.

STEP1:

 CALL apoc.periodic.iterate(
   "MATCH (n:Term) RETURN n",
   "SET n.name_clean_idx = apoc.text.replace(n.name, '[^a-zA-Z ]', '')", 
   {batchSize:10000, parallel:true})

STEP2:

CALL apoc.periodic.iterate(
  "MATCH (n:Term) RETURN n",
  "WITH n, split(n.name_clean_idx, ' ') as coll
  WITH n as term, apoc.coll.duplicatesWithCount(coll) as dupArr  
  UNWIND dupArr as dup
  MATCH (term)
     SET term.name_clean_idx = replace(replace(substring(term.name_clean_idx,0, apoc.text.indexOf(term.name_clean_idx, dup.item)+size(dup.item)), dup.item, '*') + apoc.text.replace(substring(term.name_clean_idx, apoc.text.indexOf(term.name_clean_idx, dup.item)+size(dup.item)), dup.item, ''), '*', dup.item)",
   {batchSize:10000, parallel:true})

Then create the fulltext index on this new property;

CREATE FULLTEXT INDEX TermName2  
FOR (n:Term)
ON EACH [n.name_clean_idx]

Thus, the new query will be based on a "cleaned" name (named clean for index):

CALL db.index.fulltext.queryNodes("TermName2","Arabidopsis")
YIELD node, score
RETURN score, node.name
╒════════════════════╤════════════════════════════════════════════════════════════════════╕
│"score"             │"node.name"                                                         │
╞════════════════════╪════════════════════════════════════════════════════════════════════╡
│0.07456067204475403 │"Arabidopsis thaliana"                                              │
├────────────────────┼────────────────────────────────────────────────────────────────────┤
│0.05851973593235016 │"Arabidopsis thaliana x Arabidopsis arenosa"                        │
├────────────────────┼────────────────────────────────────────────────────────────────────┤
│0.052836157381534576│"(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica"│
└────────────────────┴────────────────────────────────────────────────────────────────────┘

Upvotes: 1

Related Questions