Reputation: 409
I am trying to accomplish the following logical operation in Python but getting into memory and time issues. Since, I am very new to python, guidance on how and where to optimize the problem would be appreciated ! ( I do understand that the following question is somewhat abstract )
import networkx as nx
dic_score = {}
G = nx.watts_strogatz_graph(10000,10,.01) # Generate 2 graphs with 10,000 nodes using Networkx
H = nx.watts_strogatz_graph(10000,10,.01)
for Gnodes in G.nodes()
for Hnodes in H.nodes () # i.e. For all the pair of nodes in both the graphs
score = SomeOperation on (Gnodes,Hnodes) # Calculate a metric
dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ]) # Store the metric in the form a Key: value, where value become a list of lists, pair in a dictionary
Then Sort the lists in the generated dictionary according to the criterion mentioned here sorting_criterion
My problems/questions are:
1) Is there a better way of approaching this than using the for loops for iteration?
2) What should be the most optimized (fastest) method of approaching the above mentioned problem ? Should I consider using another data structure than a dictionary ? or possibly file operations ?
3) Since I need to sort the lists inside this dictionary, which has 10,000 keys each corresponding to a list of 10,000 values, memory requirements become huge quite quickly and I run out of it.
3) Is there a way to integrate the sorting process within the calculation of dictionary itself i.e. avoid doing a separate loop to sort?
Any inputs would be appreciated ! Thanks !
Upvotes: 2
Views: 786
Reputation: 36513
1) You can use one of functions from itertools
module for that. Let me just mention it, you can read the manual or call:
from itertools import product
help(product)
Here's an example:
for item1, item2 in product(list1, list2):
pass
2) If the result is too big to fit in memory, try saving them somewhere. You can output it into a CSV file for example:
with open('result.csv') as outfile:
writer = csv.writer(outfile, dialect='excel')
for ...
writer.write(...)
This will free your memory.
3) I think it's better to sort the result data afterwards (because sort
function is rather quick) rather than complicate the matters and sort the data on the fly.
You could instead use NumPy arroy/matrix operations (sums, products, or even map a function to each matrix row). These are so fast that sometimes filtering the data costs more than calculating everything.
If your app is still very slow, try profiling it to see exactly what operation is slow or is done too many times:
from cProfile import Profile
p = Profile()
p.runctx('my_function(args)', {'my_function': my_function, 'args': my_data}, {})
p.print_stats()
You'll see the table:
2706 function calls (2004 primitive calls) in 4.504 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.006 0.003 0.953 0.477 pobject.py:75(save_objects)
43/3 0.533 0.012 0.749 0.250 pobject.py:99(evaluate)
...
Upvotes: 5
Reputation: 4723
Others have mentioned itertools.product
. That's good, but in your case, there is another possibility: a generator expression for the inner loop, and the sorted
function. (Code untested, of course.)
import networkx as nx
from operator import itemgetter
dic_score = {}
G = nx.watts_strogatz_graph(10000,10,.01) # Generate 2 graphs with 10,000 nodes using Networkx
H = nx.watts_strogatz_graph(10000,10,.01)
for Gnodes in G.nodes():
dic_score[Gnodes] = sorted([Hnodes, score(Gnodes, Hnodes), -1] for Hnodes in H.nodes(), key=operator.itemgetter(1)) # sort on score
The inner loop is replaced by a generator expression. It is also sorted on the fly (assuming you want to sort each inner list on score
). Instead of storing in a dictionary, you could easily write each inner list to a file, which would help with memory.
Upvotes: 1
Reputation: 66729
When working with functions that returns a list, check out for a function that returns an iterator.
This will improve memory usage.
In your case, nx.nodes
returns the complete list. See: nodes
Use nodes_iter
since it returns an iterator. This should ensure that you do not have the full list of nodes in memory while iterating on the nodes in your for loop.
See: nodes_iter
Some improvement:
import networkx as nx
dic_score = {}
G = nx.watts_strogatz_graph(10000,10,.01)
H = nx.watts_strogatz_graph(10000,10,.01)
for Gnodes in G.nodes_iter() ----------------> changed from G.nodes()
for Hnodes in H.nodes_iter() -----------> changed from H.nodes()
score = SomeOperation on (Gnodes,Hnodes)
dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
You can also use the other idiom since now you have two iterators: use itertools.products
product(A, B) returns the same as ((x,y) for x in A for y in B).
Upvotes: 4