Reputation: 409

How to optimize the memory and time usage of the following algorithm in python

I am trying to accomplish the following logical operation in Python but getting into memory and time issues. Since, I am very new to python, guidance on how and where to optimize the problem would be appreciated ! ( I do understand that the following question is somewhat abstract )

import networkx as nx 
    dic_score = {}
    G = nx.watts_strogatz_graph(10000,10,.01) # Generate 2 graphs with 10,000 nodes using Networkx
    H = nx.watts_strogatz_graph(10000,10,.01)
    for Gnodes in G.nodes()
        for Hnodes in H.nodes ()  # i.e. For all the pair of nodes in both the graphs
           score = SomeOperation on (Gnodes,Hnodes)  # Calculate a metric 
           dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ]) # Store the metric in the form a Key: value, where value become a list of lists, pair in a dictionary

Then Sort the lists in the generated dictionary according to the criterion mentioned here sorting_criterion

My problems/questions are:

1) Is there a better way of approaching this than using the for loops for iteration?

2) What should be the most optimized (fastest) method of approaching the above mentioned problem ? Should I consider using another data structure than a dictionary ? or possibly file operations ?

3) Since I need to sort the lists inside this dictionary, which has 10,000 keys each corresponding to a list of 10,000 values, memory requirements become huge quite quickly and I run out of it.

3) Is there a way to integrate the sorting process within the calculation of dictionary itself i.e. avoid doing a separate loop to sort?

Any inputs would be appreciated ! Thanks !

Upvotes: 2

Answers (3)

culebrón

Reputation: 36513

1) You can use one of functions from itertools module for that. Let me just mention it, you can read the manual or call:

from itertools import product
help(product)

Here's an example:

for item1, item2 in product(list1, list2):
    pass

2) If the result is too big to fit in memory, try saving them somewhere. You can output it into a CSV file for example:

with open('result.csv') as outfile:
   writer = csv.writer(outfile, dialect='excel')
   for ...
       writer.write(...)

This will free your memory.

3) I think it's better to sort the result data afterwards (because sort function is rather quick) rather than complicate the matters and sort the data on the fly.

You could instead use NumPy arroy/matrix operations (sums, products, or even map a function to each matrix row). These are so fast that sometimes filtering the data costs more than calculating everything.

If your app is still very slow, try profiling it to see exactly what operation is slow or is done too many times:

from cProfile import Profile
p = Profile()

p.runctx('my_function(args)', {'my_function': my_function, 'args': my_data}, {})
p.print_stats()

You'll see the table:

      2706 function calls (2004 primitive calls) in 4.504 CPU seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    0.006    0.003    0.953    0.477 pobject.py:75(save_objects)
  43/3    0.533    0.012    0.749    0.250 pobject.py:99(evaluate)
...

Upvotes: 5

Reinstate Monica

Reputation: 4723

Others have mentioned itertools.product. That's good, but in your case, there is another possibility: a generator expression for the inner loop, and the sorted function. (Code untested, of course.)

import networkx as nx
from operator import itemgetter 
dic_score = {}
G = nx.watts_strogatz_graph(10000,10,.01) # Generate 2 graphs with 10,000 nodes using Networkx
H = nx.watts_strogatz_graph(10000,10,.01)
for Gnodes in G.nodes():
    dic_score[Gnodes] = sorted([Hnodes, score(Gnodes, Hnodes), -1] for Hnodes in H.nodes(), key=operator.itemgetter(1)) # sort on score

The inner loop is replaced by a generator expression. It is also sorted on the fly (assuming you want to sort each inner list on score). Instead of storing in a dictionary, you could easily write each inner list to a file, which would help with memory.

Upvotes: 1

pyfunc

Reputation: 66729

When working with functions that returns a list, check out for a function that returns an iterator.

This will improve memory usage.

In your case, nx.nodes returns the complete list. See: nodes

Use nodes_iter since it returns an iterator. This should ensure that you do not have the full list of nodes in memory while iterating on the nodes in your for loop.

See: nodes_iter

Some improvement:

import networkx as nx 
    dic_score = {}
    G = nx.watts_strogatz_graph(10000,10,.01) 
    H = nx.watts_strogatz_graph(10000,10,.01)
    for Gnodes in G.nodes_iter() ----------------> changed from G.nodes()
        for Hnodes in H.nodes_iter()  -----------> changed from H.nodes()
           score = SomeOperation on (Gnodes,Hnodes) 
           dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

You can also use the other idiom since now you have two iterators: use itertools.products

product(A, B) returns the same as ((x,y) for x in A for y in B).

Upvotes: 4

How to optimize the memory and time usage of the following algorithm in python

Answers (3)

Related Questions