Exter
Exter

Reputation: 689

Implementing proximity search in positional inverted index nodejs

I am building a positional inverted index from text, the index structure is: (suggest improvements, if any)

 {
     term: {
         documentID: {pageno:[positions], pageno:[positions]}, 
         documentID: {pageno:[positions]}
     }
 }

I want to implement a proximity search- Proximity search :

Queries of type X AND Y /3 or X Y /2 are called proximity queries. The expression means retrieve documents that contain both X and Y and 3 words or 2 words apart respectively

Reference - Boolean Retrieval Model Using Inverted Index and Positional Index

I want to implement this in NodeJS for 2 or more words. I am confused about how to implement this.

I thought of creating a search result object for each word from the index. This would have a structure like:

word :{document1: {page1:[positions], page2:[positions]}}

and then somehow compare the positions on every intersecting page for all the words and calculate the proximity.

For the search query nodejs hello world, the proximity in the string hello extra words world more extra words nodejs would be 5 - counting all extra words in between and summing them - regardless of order of the search words. Refer to this Lucene Proximity Search for phrase with more than two words

  1. Is this an efficient index structure? If yes, how to compare the positions on every intersecting page for all the words?
  2. If "jakarta apache lucene"~3, if this is the query and the text is 'jakarta jakarta apache lucene', will it match twice? - 3 being the max proximity allowed.

EDIT:

By doing a lot of things, I generated this for every document:

{
    pageno: [
        [positions of word 1],
        [positions of word 2],
        [positions of word n]
    ]
}

This makes sure to include only those pages which have all the words present.

For eg -

{
    1 : [
        [1, 5, 6],
        [2, 41],
        [3, 7, 11]
    ],
    2 : [
        [1, 5, 6],
        [2, 41],
        [3, 7, 11]
    ]
}

Now what I need to do is find the total number of occurrences of the query text on a particular page using the positions of query words mentioned in the array above such that difference between their positions is less than the proximity value.

Upvotes: 1

Views: 246

Answers (0)

Related Questions