data-oil
data-oil

Reputation: 65

Efficient and fastest way to search in a list of strings

The following function return the number of words from a list that contain the exact same characters as the word entered. The order of the characters in the words is not important. However, say there is a list that contain millions of words. What is the most efficient and fastest way to perform this search?

Example:

words_list = ['yek','lion','eky','ekky','kkey','opt'];

if we were to match the word "key" with the words in the list, the function only return "yek" and "eky" since they share the same exact characters with "key" regardless of the order.

Below is the function I wrote

def find_a4(words_list, word):
    # all possible permutations of the word that we are looking for
    # it's a set of words 
    word_permutations = set([''.join(p) for p in permutations(word)])
    word_size = len(word)
    count = 0

    for word in word_list:
        # in the case of word "key", 
        # we only accept words that have 3 characters 
        # and they are in the word_permutations 
        if len(word) == word_size and word in word_permutations:
            count += 1

    return count

Upvotes: 3

Views: 1491

Answers (2)

fferri
fferri

Reputation: 18950

A dictionary whose key is the sorted version of the word:

word_list = ['yek','lion','eky','ekky','kkey','opt']

from collections import defaultdict
word_index = defaultdict(set)

for word in word_list:
    idx = tuple(sorted(word))
    word_index[idx].add(word)

# word_index = {
#    ('e', 'k', 'y'): {'yek', 'eky'},
#    ('i', 'l', 'n', 'o'): {'lion'},
#    ('e', 'k', 'k', 'y'): {'kkey', 'ekky'},
#    ('o', 'p', 't'): {'opt'}
# }

Then for querying you would do:

def find_a4(word_index, word):
    idx = tuple(sorted(word))
    return len(word_index[idx])

Or if you need to return the actual words, change it to return word_index[idx].

Efficiency: querying runs in average in O(1) time.

Upvotes: 5

Aritesh
Aritesh

Reputation: 2103

For large string, you will have n! permutations to search. I will sort all the strings before comparison, this will be nlog(n), and will sort and compare only when lengths match -

def find_a4(words_list, word):
    word = ''.join(sorted(word))
    word_size = len(word)
    count = 0
    for word1 in words_list:
        if len(word1) == word_size:
            if word == ''.join(sorted(word1)):
                count += 1
    return count 

Upvotes: 2

Related Questions