Reputation: 689
I am building a positional inverted index from text, the index structure is: (suggest improvements, if any)
{
term: {
documentID: {pageno:[positions], pageno:[positions]},
documentID: {pageno:[positions]}
}
}
I want to implement a proximity search- Proximity search :
Queries of type X AND Y /3 or X Y /2 are called proximity queries. The expression means retrieve documents that contain both X and Y and 3 words or 2 words apart respectively
Reference - Boolean Retrieval Model Using Inverted Index and Positional Index
I want to implement this in NodeJS for 2 or more words. I am confused about how to implement this.
I thought of creating a search result object for each word from the index. This would have a structure like:
word :{document1: {page1:[positions], page2:[positions]}}
and then somehow compare the positions on every intersecting page for all the words and calculate the proximity.
For the search query nodejs hello world
, the proximity in the string hello extra words world more extra words nodejs
would be 5 - counting all extra words in between and summing them - regardless of order of the search words. Refer to this Lucene Proximity Search for phrase with more than two words
EDIT:
By doing a lot of things, I generated this for every document:
{
pageno: [
[positions of word 1],
[positions of word 2],
[positions of word n]
]
}
This makes sure to include only those pages which have all the words present.
For eg -
{
1 : [
[1, 5, 6],
[2, 41],
[3, 7, 11]
],
2 : [
[1, 5, 6],
[2, 41],
[3, 7, 11]
]
}
Now what I need to do is find the total number of occurrences of the query text on a particular page using the positions of query words mentioned in the array above such that difference between their positions is less than the proximity value.
Upvotes: 1
Views: 246