Jared Andrews
Jared Andrews

Reputation: 272

Python Printing Incredibly Slowly

I'm running into a problem I've never encountered before, and it's frustrating the hell out of me. I'm using rpy2 to interface with R from within a python script and normalize an array. For some reason, when I go to piece my output together and print to a file, it takes ages to print. It also slows down as it proceeds until it's dripping maybe a few kb of data to output per minute.

My input file is large (366 MB), but this is running on a high performance computing cluster with near unlimited resources. It should have no problem slamming through this.

Here's where I'm actually doing the normalization:

matrix = sample_list  # two-dimensional array
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
print("Performing quantile normalization.")
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array(Rnormalized_matrix)

As you can see, I end up with a numpy.array object containing my now-normalized data. I have another list containing other strings I want to print to the output as well, each element corresponding to an element of the numpy array. So I iterate through, joining each row of the array into a string and print both to output.

for thing in pos_list:  # List of strings corresponding with each row of array.
    thing_index = pos_list.index(thing)

    norm_data = normalized_matrix[thing_index]
    out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)

    print(thing + "\t" + out_data, file=output)

I'm no pro, but I have no idea why things are slowing down so much. Any insight or suggestions would be very, very appreciated. I can post more/the rest of the script if anyone thinks it may be helpful.

Update: Thanks to @lgautier for his profiling suggestion. Using the line_profiler module, I was able to pinpoint my issue to the line: thing_index = pos_list.index(thing)

This makes sense since this list is very long, and it also explains the slow down as the script proceeds. Simply using a count instead fixed the issue.

Profiling of original code (notice the % for the specified line):

   Line #      Hits         Time  Per Hit   % Time  Line Contents
   115         1     16445761 16445761.0     15.5       header, pos_list, normalized_matrix = Quantile_Normalize(in
   117         1           54     54.0      0.0         print("Creating output file...")
   120         1         1450   1450.0      0.0         output = open(output_file, "w")
   122         1            8      8.0      0.0         print(header, file=output)
   124                                                  # Iterate through each position and print QN'd data
   125    100000        74600      0.7      0.1         for thing in pos_list:
   126     99999     85244758    852.5     80.3                 thing_index = pos_list.index(thing)
   129     99999       158741      1.6      0.1                 norm_data = normalized_matrix[thing_index]
   130     99999      3801631     38.0      3.6                 out_data = "\t".join("{0:.2f}".format(piece) for pi
   132     99999       384248      3.8      0.4                 print(thing + "\t" + out_data, file=output)
   134         1         3641   3641.0      0.0         output.close()

Profiling new code:

   Line #      Hits         Time  Per Hit   % Time  Line Contents
   115         1     16177130 16177130.0     82.5       header, pos_list, normalized_matrix = Quantile_Normalize(input_file, data_start)
   116
   117         1           55     55.0      0.0         print("Creating output file...")
   118
   119
   120         1        26157  26157.0      0.1         output = open(output_file, "w")
   121
   122         1           11     11.0      0.0         print(header, file=output)
   123
   124                                                  # Iterate through each position and print QN'd data
   125         1            1      1.0      0.0         count = 0
   126    100000        62709      0.6      0.3         for thing in pos_list:
   127     99999        58587      0.6      0.3                 thing_index = count
   128     99999        67164      0.7      0.3                 count += 1
   131     99999        85664      0.9      0.4                 norm_data = normalized_matrix[thing_index]
   132     99999      2877634     28.8     14.7                 out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)
   134     99999       240654      2.4      1.2                 print(thing + "\t" + out_data, file=output)
   136         1         1713   1713.0      0.0         output.close()

Upvotes: 1

Views: 356

Answers (2)

lgautier
lgautier

Reputation: 11545

If I am understanding this correctly everything is running fine and with good performance up to (and including) the line:

normalized_matrix = np.array(Rnormalized_matrix)

At that line the resulting matrix is turned into a numpy array (literally - it can be even faster when avoiding to copy the data, as in http://rpy2.readthedocs.io/en/version_2.8.x/numpy.html?from-rpy2-to-numpy ).

I cannot see the performance issues in the rest of the script related to rpy2.

Now what might be happening is that this is not because it says "HPC" on the label that it is high-performance in any and every situation with all code. Did you consider running that slow last loop through a code profiler ? It would tell you where the time is spent.

Upvotes: 2

Valéry
Valéry

Reputation: 4704

For one thing, I usually use a generator to avoid the temporary list of many tiny strings.

out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)

But it's hard to tell if this part was the slow one.

Upvotes: 1

Related Questions