Reputation: 63

python: improve performance and/or method to avoid memory error creating, saving and deleting variable variables

I have been fighting against a function giving me a memory error and thanks to your support (Python: how to split and return a list from a function to avoid memory error) I managed to sort the issue; however, since I am not a pro-programmer I would like to ask for your opinion on my method and how to improve its performance (if possible).

The function is a generator function returning all cycles from an n-nodes digraph. However, for a 12 nodes digraph, there are about 115 million cycles (each defined as a list of nodes, e.g. [0,1,2,0] is a cycle). I need all cycles available for further processing even after I have extracted some of their properties when they were first generated, so they need to be stored somewhere. So, the idea is to cut the result array every 10 million cycles to avoid memory error (when an array is too big, python runs out of RAM) and create a new array to store the following results. In the 12 node digraph, I would then have 12 result arrays, 11 full ones (containing 10 million cycles each) and the last containing 5 million cycles.

However, splitting the result array is not enough since the variables stay in RAM. So, I still need to write each one to the disk and delete it afterwards to clear the RAM.

As stated in How do I create a variable number of variables?, using 'exec' to create variable variable names is not very "clean" and dictionary solutions are better. However, in my case, if I store the results in a single dictionary, it will run out of memory due to the size of the arrays. Hence, I went for the 'exec' way. I would be grateful if you could comment on that decision.

Also, to store the arrays I use numpy.savez_compressed which gives me a 43 Mb file for each 10million cycles array. If it is not compressed it creates a 500 Mb file. However, using the compressed version slows the writing process. Any idea how to speed the writing and/or compressing process?

A simplified version of the code I wrote is as follows:

nbr_result_arrays=0
result_array_0=[]
result_lenght=10000000
tmp=result_array_0 # I use tmp to avoid using exec within the for loop (exec slows down code execution) 
for cycle in generator:
    tmp.append(cycle)    
    if len(tmp) == result_lenght:
        exec 'np.savez_compressed(\'results_' +str(nbr_result_arrays)+ '\', tmp)'
        exec 'del result_array_'+str(nbr_result_arrays)
        nbr_result_arrays+=1
        exec 'result_array_'+str(nbr_result_arrays)+'=[]'
        exec 'tmp=result_array_'+str(nbr_result_arrays)

Thanks for reading,

Aleix

Upvotes: 0

Answers (2)

Aleix

Reputation: 63

thanks to all for your suggestions.

As suggested by @Aya, I believe that to improve performance (and possible space issues) I should avoid to store the results on the HD because storing them adds half of the time than creating the result, so loading and processing it again would get very close to creating the result again. Additionally, if I do not store any result, I save space which can become a big issue for bigger digraphs (a 12 node complete digraphs has about 115 million cycles but a 29 node ones has about 848E27 cycles... and increasing at factorial rate).

The idea is that I first need to find through all cycles going through the weakest arc to find the total probability of all cycles going it. Then, with this total probability I must go again through all those cycles to subtract them from the original array according to the weighted probability (I needed the total probability to be able to calculate the weighted probalility: weighted_prob= prob_of_this_cycle/total_prob_through_this_edge).

Thus, I believe that this is the best approach to do that (but I am open to more discussions! :) ).

However, I have a doubt regarding speed processing regarding two sub-functions:

1st: find whether a sequence contains a specific (smaller) sequence. I am doing that with the function "contains_sequence" which relies on the generator function "window" (as suggested in Is there a Python builtin for determining if an iterable contained a certain sequence? However I have been told that doing it with a deque would be up to 33% faster. Any other ideas?
2nd: I am currently finding the cycle probability of a cycle by sliding through the cycle nodes (which is represented by a list) to find the probability at the output of each arc to stay within the cycle and then multiply them all to find the cycle probability (the function name is find_cycle_probability). Any performance suggestions on this function would be appreciated since I need to run it for each cycle, i.e. countless times.

Any other tips/suggestion/comments will be most welcome! And thanks again for your help.

Aleix

Below follows the simplified code:

def simple_cycles_generator_w_filters(working_array_digraph, arc):
    '''Generator function generating all cycles containing a specific arc.'''
    generator=new_cycles.simple_cycles_generator(working_array_digraph)
    for cycle in generator:
        if contains_sequence(cycle, arc):             
            yield cycle
    return

def find_smallest_arc_with_cycle(working_array,working_array_digraph):
    '''Find the smallest arc through which at least one cycle flows.
    Returns:
        - if such arc exist:
            smallest_arc_with_cycle = [a,b] where a is the start of arc and b the end
            smallest_arc_with_cycle_value = x where x is the weight of the arc
        - if such arc does not exist:
            smallest_arc_with_cycle = []
            smallest_arc_with_cycle_value = 0 '''
    smallest_arc_with_cycle = []
    smallest_arc_with_cycle_value = 0
    sparse_array = []
    for i in range(numpy.shape(working_array)[0]):
        for j in range(numpy.shape(working_array)[1]):
            if working_array[i][j] !=0:
                sparse_array.append([i,j,working_array[i][j]])
    sorted_array=sorted(sparse_array, key=lambda x: x[2])
    for i in range(len(sorted_array)):
        smallest_arc=[sorted_array[i][0],sorted_array[i][1]]
        generator=simple_cycles_generator_w_filters(working_array_digraph,smallest_arc)
        if any(generator):
            smallest_arc_with_cycle=smallest_arc
            smallest_arc_with_cycle_value=sorted_array[i][2]
            break

    return smallest_arc_with_cycle,smallest_arc_with_cycle_value

def window(seq, n=2):
    """Returns a sliding window (of width n) over data from the iterable
    s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ... """
    it = iter(seq)
    result = list(itertools.islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + [elem]
        yield result

def contains_sequence(all_values, seq):
    return any(seq == current_seq for current_seq in window(all_values, len(seq)))


def find_cycle_probability(cycle, working_array, total_outputs):
    '''Finds the cycle probability of a given cycle within a given array'''
    output_prob_of_each_arc=[]
    for i in range(len(cycle)-1):
        weight_of_the_arc=working_array[cycle[i]][cycle[i+1]]
        output_probability_of_the_arc=float(weight_of_the_arc)/float(total_outputs[cycle[i]])#NOTE:total_outputs is an array, thus the float
        output_prob_of_each_arc.append(output_probability_of_the_arc)
    circuit_probabilities_of_the_cycle=numpy.prod(output_prob_of_each_arc)    
    return circuit_probabilities_of_the_cycle 

def clean_negligible_values(working_array):
    ''' Cleans the array by rounding negligible values to 0 according to a 
    pre-defined threeshold.'''
    zero_threeshold=0.000001
    for i in range(numpy.shape(working_array)[0]):
        for j in range(numpy.shape(working_array)[1]):
            if working_array[i][j] == 0:
                continue
            elif 0 < working_array[i][j] < zero_threeshold:
                working_array[i][j] = 0
            elif -zero_threeshold <= working_array[i][j] < 0:
                working_array[i][j] = 0
            elif working_array[i][j] < -zero_threeshold:
                sys.exit('Error')    
    return working_array

original_array= 1000 * numpy.random.random_sample((5, 5))
total_outputs=numpy.sum(original_array,axis=0) + 100 * numpy.random.random_sample(5)

working_array=original_array.__copy__() 
straight_array= working_array.__copy__() 
cycle_array=numpy.zeros(numpy.shape(working_array))
iteration_counter=0
working_array_digraph=networkx.DiGraph(working_array)

[smallest_arc_with_cycle, smallest_arc_with_cycle_value]= find_smallest_arc_with_cycle(working_array, working_array_digraph) 

while smallest_arc_with_cycle: # using implicit true value of a non-empty list

    cycle_flows_to_be_subtracted = numpy.zeros(numpy.shape((working_array)))

    # FIRST run of the generator to calculate each cycle probability
    # note: the cycle generator ONLY provides all cycles going through 
    # the specified weakest arc    
    generator = simple_cycles_generator_w_filters(working_array_digraph, smallest_arc_with_cycle)
    nexus_total_probs = 0
    for cycle in generator:
        cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)
        nexus_total_probs += cycle_prob

    # SECOND run of the generator
    # using the nexus_prob_sum calculated before, I can allocate the weight of the 
    # weakest arc to each cycle going through it
    generator = simple_cycles_generator_w_filters(working_array_digraph,smallest_arc_with_cycle)
    for cycle in generator:
        cycle_prob = find_cycle_probability(cycle, working_array, total_outputs)        
        allocated_cycle_weight = cycle_prob / nexus_total_probs * smallest_arc_with_cycle_value
        # create the array to be substracted
        for i in range(len(cycle)-1):
            cycle_flows_to_be_subtracted[cycle[i]][cycle[i+1]] += allocated_cycle_weight 

    working_array = working_array - cycle_flows_to_be_subtracted
    clean_negligible_values(working_array)    
    cycle_array = cycle_array + cycle_flows_to_be_subtracted   
    straight_array = straight_array - cycle_flows_to_be_subtracted
    clean_negligible_values(straight_array)
    # find the next weakest arc with cycles.
    working_array_digraph=networkx.DiGraph(working_array)
    [smallest_arc_with_cycle, smallest_arc_with_cycle_value] = find_smallest_arc_with_cycle(working_array,working_array_digraph)

Upvotes: 0

falsetru

Reputation: 369274

How about using itertools.islice?

import itertools
import numpy as np

for i in itertools.count():
    tmp = list(itertools.islice(generator, 10000000))
    if not tmp:
        break
    np.savez_compressed('results_{}'.format(i), tmp)
    del tmp

Upvotes: 1

python: improve performance and/or method to avoid memory error creating, saving and deleting variable variables

Answers (2)

Related Questions