yhussain
yhussain

Reputation: 308

Algorithm for searching the shortest length between words

I'm trying to come up with an algorithm for finding the shortest distance between a list of words. I have a dictionary of lists that shows different locations in a document a word is found.

matches{"the" : [2, 24, 15], "is" : [5, 13], "apple" : {45} ... }

Are there established algorithms for finding the shortest length where all of these overlap? For example, in this one 13-45 would be the answer because all of the words can be found in that range.

Upvotes: 0

Views: 67

Answers (1)

ganbustein
ganbustein

Reputation: 1583

I'd keep two positions, left and right which are, respectively, the left and right ends of a range that contains all the words. I'd also maintain a priority queue, each entry of which is a word and the list of places the word occurs at or after the current left edge.

To initialize, make a new empty priority queue, insert each word with it's full list of occurrences, properly sorted. As you insert each word, update right so it's the maximum first occurrence of any word. For your data, the initial setup would be

left=2,right=45,queue=[["the", [2,15,24]], ["is", [5, 13], ["apple", [45]]

I'm showing the priority queue as if it were an array, sorted by the first component of its second component. That is, in the order 2 (for "the"), 5 (for "is"), and 45 (for "apple"). Note that the occurrences for "the" had to be sorted during this initialization. right turns out to be 45, the maximum of 2, 5, and 45.

left is implicit. It's always the first occurrence of whatever's at the front of the priority queue. At this point the shortest range we've found is 2..45.

Then repeat the following loop:

remove the first entry from the priority queue
shift its next occurrence into `left`
check if left..right is a new shortest sequence
if we've shifted off the last occurrence for this entry
    stop
otherwise, 
    update `right` to include this new next occurrence
    insert the entry back into the priority queue

With your data, the successive values would be:

left=2,right=45,queue=[["the", [2,15,24]], ["is", [5, 13], ["apple", [45]]
left=5,right=45,queue=[["is", [5, 13], ["the", [15,24]], ["apple", [45]]
left=13,right=45,queue=[["is", [13], ["the", [15,24]], ["apple", [45]]

and we terminate, because after popping ["is", [13]] from the queue and shifting 13 off its list of occurrences, none remain.

Upvotes: 1

Related Questions