Comparing data flow graphs quickly in Python

Question

Background

I want to prove that two data processing functions are functionally equivalent. To do this I built an abstract syntax tree for each function and then ran a simulation of the control flow to construct a data flow graph of the final output.

The data flow graph is represented by 3-tuples of (operation, operanda, operandb) such that (a+b)*2 would be represented as:

('*',('+','a','b'),'2')

Some of my operations are commutative, so the data flow from an equivalent graph might be:

('*','2',('+','b','a'))

Question

How can I check that my 2 data flow graphs are isomorphic (i.e. perform exactly the same operations)?

What I've tried

My idea was to try and turn each data flow graph into a canonical form by detecting commutative operators and placing their operands into sorted order (e.g. lexicographical order although I don't think the order matters as long as I am consistent). I can then compare the reordered tuples for strict equality.

For my graphs I think this algorithm should be sufficient even though it would not spot equivalence between things like (a+b)+c and a+(b+c).

However, I ran into an efficiency problem even for small graphs.

For example, this Python code constructs a trivial data flow graph with just 28 operations, but it takes Python 8 seconds to compare the tuples (and larger graphs get exponentially worse):

from time import time

def make_dataflow_graph(n):
    A='y'
    for i in range(n):
        A=('+',A,A)
    return A

G1 = make_dataflow_graph(28)
G2 = make_dataflow_graph(28)

t0 = time()
print G1



I suppose the problem is that Python does the comparison recursively so it is wasting lots of time comparing the same nodes again and again.

Is there some Pythonic way to make a tuple comparison more efficient, or is there some better algorithm for comparing my data flow graphs?

David Eisenstat · Accepted Answer

I can't speak to whether it's idiomatic Python, but you could borrow an idea from Lisp called hash consing. In a sentence, the idea is to use a hash table to increase subobject sharing to the max, allowing tuples to be compared by identity instead of deeply.

Something like this:

canonical_ids = set()
canonical_objs = {}
canonical_tuples = {}

def canonical_object(obj):
    if id(obj) in canonical_ids:
        return obj
    if isinstance(obj, tuple):
        operator, left_operand, right_operand = map(canonical_object, obj)
        if operator in {'+', '*'} and id(left_operand) > id(right_operand):
            left_operand, right_operand = right_operand, left_operand
        obj = operator, left_operand, right_operand
        canon_obj = canonical_tuples.setdefault(tuple(map(id, obj)), obj)
    else:
        canon_obj = canonical_objs.setdefault(obj, obj)
    canonical_ids.add(id(canon_obj))
    return canon_obj

Then you can do things like

canonical_object(obj1) is canonical_object(obj2)

Comparing data flow graphs quickly in Python

Background

Question

What I've tried

Answers (1)

Related Questions