boralim
boralim

Reputation: 45

Raising performance of BFS in Python

How can I increase speed performance of below Python code?

My code works okay which means no errors but the performance of this code is very slow.

The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)

Problem definition:

Check if the distance between two nodes are less than max_distance

My constraints:

Can anyone give me an advice to improve my code much better? Follow my code:

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                self.graph[key] = self.graph.get(key, []) + [val]

    def check_distance(self, x, y, max_distance):          
        dist = self.path(x, y, max_distance)
        if dist:
            return dist - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

Thank you for your help in advance.

Upvotes: 0

Views: 2400

Answers (2)

MindOfMetalAndWheels
MindOfMetalAndWheels

Reputation: 339

Several pointers:

  1. if you call check distance more than once you have to recreate the graph
  2. calling queue.pop(0) is inefficient on a standard list in python, use something like a deque from the collections module. see here
  3. as DarrylG points out you can exit from the BFS early once a path has exceed the max distance

you could try

from collections import deque

class Graph:
    def __init__(self, filename):
        self.filename = filename
        self.graph = self.file_to_graph()

    def file_to_graph(self):
        graph = {}
        with open(self.filename) as input_data:
            for line in input_data:
                key, val = line.strip().split(',')
                graph[key] = graph.get(key, []) + [val]
        return graph

    def check_distance(self, x, y, max_distance):          
        path_length = self.path(x, y, max_distance)
        if path_length:
            return len(path) - 1 <= max_distance
        else:
            return False

    def path(self, x, y, max_distance):
        start, end = str(x), str(y)
        queue = deque([start])
        while queue:
            path = queue.popleft()
            node = path[-1]
            if node == end:
                return len(path)
            elif len(path) > max_distance:
                # we have explored all paths shorter than the max distance
                return False
            else:
                for adjacent in self.graph.get(node, []):
                    queue.append(list(path) + [adjacent])

As to why pop(0) is inefficient - from the docs:

Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.

Upvotes: 1

William Prigol Lopes
William Prigol Lopes

Reputation: 1889

About the approach:

You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V| at every time or, you need to calculate every time the distances again and again. This is not a good approach.

What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step

About the algorithms

I recommend you to look for different approaches of DFS/BFS.

If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.

Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)

As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.

If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.

The next step is collect from memory the distances that you can put in an array.

You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.

Upvotes: 0

Related Questions