monty
monty

Reputation: 81

Quickly search all elements in two lists

Say I have two large lists, list_of_A_objects which contains objects of class A, and list _of_B_objects which contains objects of class B.

They both have string members.

I want to be able to search through all the elements in the two lists and if the string member of an A object is a sub-string of the string member of a B object I want it to do something.

What I've got below is fine if the lists are quite small, but if the lists are large it can take a long time.

Is there a way to make this faster. I've been thinking about using dictionaries in some way because they have fast lookups but I cant figure it out.

This is what I have so far.

class A:
    def __init__(self, x):
        self.string = x

class B:
    def __init__(self,x):
        self.string = x

list_of_A_objects = get_large_list_of_A_objects()

list_of_B_objects = get_large_list_of_B_objects() 


for A_object in list_of_A_objects:
    for B_Object in list_of_B_objects:
        if A_object.string in B_Object.string:
            do_something()

Upvotes: 4

Views: 295

Answers (2)

FuzzyTree
FuzzyTree

Reputation: 32392

Here's a python implementation using a dictionary. First convert one of the lists into a indexed by its object's strings

a_map = {}

for A_object in list_of_A_objects:
    a_map[A_object.string] = A_object

Then for each object in the other list, check whether the object's string exists in the dictionary (in constant time), and if so do_something

for B_object in list_of_B_objects:
    if B_object.string in a_map:
        do_something(a_map[B_object.string])

This assumes that each A_object has a unique string. If that's not the case then you can make the values of a_map an array of objects instead of a single object.

Upvotes: 0

Jim Mischel
Jim Mischel

Reputation: 133975

One thing you can do is create a single string from the B objects. While building that, you also create a list of indexes, so you know the index of each string in the larger string. See the code below.

Note that I'm not a python programmer, so you'll have to interpret my pseudo-code.

BStrings = ""
list_of_Indexes = new list of int
for B_object in list_of_B_objects
    list_of_Indexes.Add(length of BStrings)
    BStrings = BStrings + B_Object.string + newline

Now, you can search the BStrings string for each A_object. If the string is found, the function returns the index of where it was found in the string. You can then binary search the list_of_indexes to determine which B_object contains that string.

This doesn't really change the complexity of the operation (it's still MxN, where M is the number of objects in the A list, and N is the length of the B list), but searching a single string for substrings will be faster than looping over the B list because it avoids the overhead of setting up the search.

If even that is too slow, then you'll want to use something like the Aho-Corasick string matching algorithm. There's probably a decent Python implementation available.

Upvotes: 2

Related Questions