Dominic Farolino
Dominic Farolino

Reputation: 1383

Implementing Basic Efficient "Search" Algorithm: Python

Mock Search Algorithm

Here is my problem:

In a given document(string) such as:

"hello there my name is dominic and my name is very special"

and searchTerms(list) such as:

['my','dominic'] or ['dominic','my'] (shouldn't matter)

an algorithm would return the shortest excerpt of the document containing the terms:

>>> 'dominic and my'

because

>>> 'my name is dominic'

contains more words than the previous.

Here is my current idea:

Start the algorithm off by creating a list with a number of inner list elements equal to the number of search terms. Inside each number of list would be indices in which that search element appears in the document.

document = document.split();

So searchTerms = ['my', 'dominic'] would return

                 [[2,7], [5]] 

because my appears at index 2 and 7 and dominic appears only at 5.

The algorithm would then take this list and generate a list of all posibilities:

                 [[2,5], [7,5]]

As you can see there are two substrings in the document string that contain both my and dominic. Then the algo could take the range of both inner lists my doing a max()-min() This would tell me the second exerpt of document is smaller than the first and could then return document[5:(7+1)] which would be the expected outcome.

For my idea this is what I have so far:

document = "hello there my name is dominic and my name is very special"
searchTerms = ['my', 'dominic']
def answer(document, searchTerms):
    index = []
    document = document.split()
    for a in range(0, len(searchTerms)):
        index.append([i for i, x in enumerate(document) if x == searchTerms[a]])
    return index

As of now this returns [[2,7],[5]] however I am running into mainly one issue:

- Efficiency:

Is this solution efficient for a document string and search terms list that is extremely large? And if not what could be done to make it more efficient or is my original idea not good

I appreciate any insight into solving this issue, thank you.

Upvotes: 0

Views: 1211

Answers (1)

Jarlax
Jarlax

Reputation: 1576

Your algorithm can perform really slow on mediocre inputs. Imagine that you have 10 search terms and text consisting of 10000 words. In this case it's possible that for each term you will have list of 1000 indexes. This will end with generation of 1000^10 total possibilities.

In terms of big-O notation complexity is O((n/k)^k) where n is number of terms in text, k - number of search terms.

Here is idea faster algorithm. While iterating document word by word we need to keep track of closest search term indexes to the current position. Let's call this structure lookup (simple python's dict). Quick example:

"hello there my name is dominic and >my< name is very special"

Suppose we are about to visit "my" word which is highlighted. At this moment lookup is {"my": 2, "dominic": 5}. Current "my" will be closer to any further word in text. So when visiting next word ("name") we will have updated version {"my": 7, "dominic": 5}. It's easy to see that optimal solution corresponds to one of the states of lookup. So to get an answer just keep track of max()-min() of values in the dictionary. Note: you should start keeping track only when all search terms will be present as keys of lookup.

At each occurrence of search term we need to iterate k values from positions lookup, so complexity of this algorithm is O(nk).

To make it even better, you can additionally use balanced BST with indexes from lookup. Now instead of iterating lookup values (O(k)) you can retrieve min index in O(logk):

min_index = bst.min()
old_index=lookup[curr_term] # O(1)
bst.delete(old_index) # O(logk)
bst.insert(new_index) # O(logk)

In this case total complexity will be O(nlogk).

Edit. The code without tree optimization (haven't found built-in BST in Python):

document = "hello there my name is dominic and my name is very special"
searchTerms = { 'my', 'dominic' } # set has faster lookups

doc_words = document.split()

from sys import maxint
def search(words, terms):
    found_terms = [[i,x] for i,x in enumerate(words) if x in terms]
    lookup = {}
    cnt = maxint
    k = len(terms)
    start,end=-1,-1
    for i,w in found_terms:
        lookup[w] = i 
        if k == len(lookup):
            min_idx = min(lookup.values())
            curr = i - min_idx
            if curr < cnt:
                cnt,start,end = curr,min_idx,i
    return words[start:end + 1]

Upvotes: 1

Related Questions