Reputation: 59
i have question about the neo4j fulltext search. I am currently working on a database with a lot of species names and i came across some behaviour i am trying to avoid.
Consider a fresh neo4j db with 3 nodes (link to sandbox).
CREATE (:Term {name: "(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica"}),
(:Term {name: "Arabidopsis thaliana"}),
(:Term {name: "Arabidopsis thaliana x Arabidopsis arenosa"})
and one fulltext index
CREATE FULLTEXT INDEX TermName IF NOT EXISTS
FOR (n:Term)
ON EACH [n.name]
If i now run the following search:
CALL db.index.fulltext.queryNodes("TermName","Arabidopsis")
YIELD node, score
RETURN score, node.name
You will find the following:
So my query will return the complex "(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica"
first, because it contains the search phrase three times. Is there a way to tell the fulltext search query to ignore duplicate words? So that "Arabidopsis thaliana"
would be scored highest because it contains the least amount of characters not contained in the search phrase?
Upvotes: 1
Views: 131
Reputation: 12684
I cannot find a configuration in the fulltext query where you can ignore or filter the duplicated words in the index.
However, I have a suggested alternative solution for your question. I will create a new property (call it name_clean_idx) which is a clone of name but the duplicated word(s) are removed. The trick is to replace the first occurrence of the duplicated words by * then do a replace all then put back the first occurrence of the duplicated word.
An APOC function apoc.coll.duplicatesWithCount is used to find duplicated words in the name and split() function to create a list of word(s) in the name.
STEP1:
CALL apoc.periodic.iterate(
"MATCH (n:Term) RETURN n",
"SET n.name_clean_idx = apoc.text.replace(n.name, '[^a-zA-Z ]', '')",
{batchSize:10000, parallel:true})
STEP2:
CALL apoc.periodic.iterate(
"MATCH (n:Term) RETURN n",
"WITH n, split(n.name_clean_idx, ' ') as coll
WITH n as term, apoc.coll.duplicatesWithCount(coll) as dupArr
UNWIND dupArr as dup
MATCH (term)
SET term.name_clean_idx = replace(replace(substring(term.name_clean_idx,0, apoc.text.indexOf(term.name_clean_idx, dup.item)+size(dup.item)), dup.item, '*') + apoc.text.replace(substring(term.name_clean_idx, apoc.text.indexOf(term.name_clean_idx, dup.item)+size(dup.item)), dup.item, ''), '*', dup.item)",
{batchSize:10000, parallel:true})
Then create the fulltext index on this new property;
CREATE FULLTEXT INDEX TermName2
FOR (n:Term)
ON EACH [n.name_clean_idx]
Thus, the new query will be based on a "cleaned" name (named clean for index):
CALL db.index.fulltext.queryNodes("TermName2","Arabidopsis")
YIELD node, score
RETURN score, node.name
╒════════════════════╤════════════════════════════════════════════════════════════════════╕
│"score" │"node.name" │
╞════════════════════╪════════════════════════════════════════════════════════════════════╡
│0.07456067204475403 │"Arabidopsis thaliana" │
├────────────────────┼────────────────────────────────────────────────────────────────────┤
│0.05851973593235016 │"Arabidopsis thaliana x Arabidopsis arenosa" │
├────────────────────┼────────────────────────────────────────────────────────────────────┤
│0.052836157381534576│"(Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica"│
└────────────────────┴────────────────────────────────────────────────────────────────────┘
Upvotes: 1