Raising performance of BFS in Python

Question

How can I increase speed performance of below Python code?

My code works okay which means no errors but the performance of this code is very slow.

The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)

Problem definition:

Check if the distance between two nodes are less than max_distance

My constraints:

I have to import a .txt file of which format is like sample_input
Expected ouput is like sample_output
Totall code runtime should be less than 5 secs.

Can anyone give me an advice to improve my code much better? Follow my code:

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                self.graph[key] = self.graph.get(key, []) + [val]

    def check_distance(self, x, y, max_distance):          
        dist = self.path(x, y, max_distance)
        if dist:
            return dist - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

Thank you for your help in advance.

MindOfMetalAndWheels · Accepted Answer

Several pointers:

if you call check distance more than once you have to recreate the graph
calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
as DarrylG points out you can exit from the BFS early once a path has exceed the max distance

you could try

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = self.file_to_graph()

    def file_to_graph(self):
        graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                graph[key] = graph.get(key, []) + [val]
        return graph

    def check_distance(self, x, y, max_distance):          
        path_length = self.path(x, y, max_distance)
        if path_length:
            return len(path) - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                # we have explored all paths shorter than the max distance
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

As to why pop(0) is inefficient - from the docs:

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

Raising performance of BFS in Python

Answers (2)

About the approach:

About the algorithms

Related Questions