Reputation: 45
How can I increase speed performance of below Python code?
My code works okay which means no errors but the performance of this code is very slow.
The input data is Facebook Large Page-Page Network dataset, you can access here the dataset: (http://snap.stanford.edu/data/facebook-large-page-page-network.html)
Problem definition:
Check if the distance between two nodes are less than max_distance
My constraints:
I have to import a .txt
file of which format is like sample_input
Expected ouput is like sample_output
Totall code runtime should be less than 5 secs.
Can anyone give me an advice to improve my code much better? Follow my code:
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
self.graph[key] = self.graph.get(key, []) + [val]
def check_distance(self, x, y, max_distance):
dist = self.path(x, y, max_distance)
if dist:
return dist - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
Thank you for your help in advance.
Upvotes: 0
Views: 2400
Reputation: 339
Several pointers:
queue.pop(0)
is inefficient on a standard list in python, use something like a deque
from the collections module. see hereyou could try
from collections import deque
class Graph:
def __init__(self, filename):
self.filename = filename
self.graph = self.file_to_graph()
def file_to_graph(self):
graph = {}
with open(self.filename) as input_data:
for line in input_data:
key, val = line.strip().split(',')
graph[key] = graph.get(key, []) + [val]
return graph
def check_distance(self, x, y, max_distance):
path_length = self.path(x, y, max_distance)
if path_length:
return len(path) - 1 <= max_distance
else:
return False
def path(self, x, y, max_distance):
start, end = str(x), str(y)
queue = deque([start])
while queue:
path = queue.popleft()
node = path[-1]
if node == end:
return len(path)
elif len(path) > max_distance:
# we have explored all paths shorter than the max distance
return False
else:
for adjacent in self.graph.get(node, []):
queue.append(list(path) + [adjacent])
As to why pop(0) is inefficient - from the docs:
Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.
Upvotes: 1
Reputation: 1889
You create a graph and execute several times comparison from one to another element in your graph. Every time you run your BFS algorithm. This will create a cost of O|E+V|
at every time or, you need to calculate every time the distances again and again. This is not a good approach.
What I recommend. Run a Dijkstra Algorithm (that get the minimum distance between 2 nodes and store the info on an adjacency matrix. What you will need to do is only get the calculated info inside this adjacency matrix that will contains all minimal distances on your graph and what you will need is consume the distances calculated on a previous step
I recommend you to look for different approaches of DFS/BFS.
If you're looking to compare all nodes I believe that Dijkstra's Algorithm will be more efficient in your case because they mark visited paths.(https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm). You can modify and call the algo only once.
Other thing that you need to check is. Your graph contains cycles? If yes, you need to apply some control on cycles, you'll need to check about Ford Fulkerson Algorithm (https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm)
As I understood. Every time that you want to compare a node to another, you run again your algorithm. If you have 1000 elements on your graph, your comparison, at every time will visit 999 nodes to check this.
If you implement a Dijkstra and store the distances, you run only once for your entire network and save the distances in memory.
The next step is collect from memory the distances that you can put in an array.
You can save all distances on an adjacency matrix (http://opendatastructures.org/versions/edition-0.1e/ods-java/12_1_AdjacencyMatrix_Repres.html) and only consume the information several times without the calculation debt at every time.
Upvotes: 0