user0
user0

Reputation: 195

Performance of checking "expanded list" equality

I need to check whether a given list is equal to the result of substituting some lists for some elements in another list. Concretely, I have a dictionary, say f = {'o': ['a', 'b'], 'l': ['z'], 'x': ['y']} and a list list1 = ['H', 'e', 'l', 'l', 'o'], so I want to check if some list2 is equal to ['H', 'e', 'z', 'z', 'a', 'b'].

Below, I first write a function apply to compute the image of list1 under f. Then, it suffices to write list2 == apply(list1, f). Since this function will be called thousands of times in my program, I need to make it very fast. Therefore, I thought of the second function below, which should be faster but turns out not to be. So my questions (detailed below) are: why? And: is there a faster method?

First function:

def apply(l, f):
    result = []
    for x in l:
        if x in f:
            result.extend(f[x])
        else:
            result.append(x)
    return result

Second function:

def apply_equal(list1, f, list2):
    i = 0
    for x in list1:
        if x in f:
            sublist = f[x]
            length = len(sublist)
            if list2[i:i + length] != substmt:
                return False
            i += length
        else:
            if list2[i] != x:
                return False
            i += 1
    return i == len(list2)

I thought the second method would be faster since it does not construct the list which is the image of the first list by the function and then checks equality with the second list. On the contrary, it checks equality "on the fly" without constructing a new list. So I was surprised to see that it is not faster (and even: a bit slower). For the record: list1, list2, and the lists which are values in the dictionary are all small (typically under 50 elements), as well as the number of keys of the dictionary.

So my questions are: why isn't the second method faster ? And: are there ways to do this faster (possibly using other data structures)?

Edit in response to the comments: list1 and list2 will most often be different, but f may be common to some of them. Typically, there could be around 100,000 checks in batches of around 50 consecutive checks with a common f. The elements of the lists are short strings. It is expected that all checks return True (so the whole lists have to be iterated over).

Upvotes: 2

Views: 77

Answers (2)

Kelly Bundy
Kelly Bundy

Reputation: 27599

Without proper data for benchmarking it's hard to say, so I tested various solutions with various "sizes" of data.

  1. Replacing result.extend(f[x]) with result += f[x] always made it faster.

  2. This was faster for longer lists (using itertools.chain):

    list2 == list(chain.from_iterable(map(f.get, list1, zip(list1))))
    
  3. If the data allows it, explicitly storing all possible keys and always accessing with f[x] would speed it up. That is, set f[k] = k, for all "missing" keys in advance, so you don't have to check with in or use get.

Upvotes: 1

Julien Palard
Julien Palard

Reputation: 11546

You need to use profiling tools like scalene to see what's slow in your code, don't try to guess.

In case you want to read it, I was able to produce an even slower version based on your idea of stoping as soon as possible, but while keeping the first readable apply implementation:

def apply(l, f):
    for x in l:
        if x in f:
            yield from f[x]
        else:
            yield x


def apply_equal(l1, f, l2):
    return all(left == right for left, right in zip(apply(l1, f), l2, strict=True))

Beware it needs Python 3.10 for zip's strict=True.

As the comments told, speed highly depends on your data here, constructing the whole list may look faster on small datasets, but halting soon may be faster on a bigger list.

Upvotes: 1

Related Questions