Reputation: 272
I'm running into a problem I've never encountered before, and it's frustrating the hell out of me. I'm using rpy2
to interface with R
from within a python script and normalize an array. For some reason, when I go to piece my output together and print to a file, it takes ages to print. It also slows down as it proceeds until it's dripping maybe a few kb of data to output per minute.
My input file is large (366 MB), but this is running on a high performance computing cluster with near unlimited resources. It should have no problem slamming through this.
Here's where I'm actually doing the normalization:
matrix = sample_list # two-dimensional array
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
print("Performing quantile normalization.")
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array(Rnormalized_matrix)
As you can see, I end up with a numpy.array
object containing my now-normalized data. I have another list containing other strings I want to print to the output as well, each element corresponding to an element of the numpy array. So I iterate through, joining each row of the array into a string and print both to output.
for thing in pos_list: # List of strings corresponding with each row of array.
thing_index = pos_list.index(thing)
norm_data = normalized_matrix[thing_index]
out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)
print(thing + "\t" + out_data, file=output)
I'm no pro, but I have no idea why things are slowing down so much. Any insight or suggestions would be very, very appreciated. I can post more/the rest of the script if anyone thinks it may be helpful.
Update:
Thanks to @lgautier for his profiling suggestion. Using the line_profiler
module, I was able to pinpoint my issue to the line:
thing_index = pos_list.index(thing)
This makes sense since this list is very long, and it also explains the slow down as the script proceeds. Simply using a count instead fixed the issue.
Profiling of original code (notice the % for the specified line):
Line # Hits Time Per Hit % Time Line Contents
115 1 16445761 16445761.0 15.5 header, pos_list, normalized_matrix = Quantile_Normalize(in
117 1 54 54.0 0.0 print("Creating output file...")
120 1 1450 1450.0 0.0 output = open(output_file, "w")
122 1 8 8.0 0.0 print(header, file=output)
124 # Iterate through each position and print QN'd data
125 100000 74600 0.7 0.1 for thing in pos_list:
126 99999 85244758 852.5 80.3 thing_index = pos_list.index(thing)
129 99999 158741 1.6 0.1 norm_data = normalized_matrix[thing_index]
130 99999 3801631 38.0 3.6 out_data = "\t".join("{0:.2f}".format(piece) for pi
132 99999 384248 3.8 0.4 print(thing + "\t" + out_data, file=output)
134 1 3641 3641.0 0.0 output.close()
Profiling new code:
Line # Hits Time Per Hit % Time Line Contents
115 1 16177130 16177130.0 82.5 header, pos_list, normalized_matrix = Quantile_Normalize(input_file, data_start)
116
117 1 55 55.0 0.0 print("Creating output file...")
118
119
120 1 26157 26157.0 0.1 output = open(output_file, "w")
121
122 1 11 11.0 0.0 print(header, file=output)
123
124 # Iterate through each position and print QN'd data
125 1 1 1.0 0.0 count = 0
126 100000 62709 0.6 0.3 for thing in pos_list:
127 99999 58587 0.6 0.3 thing_index = count
128 99999 67164 0.7 0.3 count += 1
131 99999 85664 0.9 0.4 norm_data = normalized_matrix[thing_index]
132 99999 2877634 28.8 14.7 out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)
134 99999 240654 2.4 1.2 print(thing + "\t" + out_data, file=output)
136 1 1713 1713.0 0.0 output.close()
Upvotes: 1
Views: 356
Reputation: 11545
If I am understanding this correctly everything is running fine and with good performance up to (and including) the line:
normalized_matrix = np.array(Rnormalized_matrix)
At that line the resulting matrix is turned into a numpy array (literally - it can be even faster when avoiding to copy the data, as in http://rpy2.readthedocs.io/en/version_2.8.x/numpy.html?from-rpy2-to-numpy ).
I cannot see the performance issues in the rest of the script related to rpy2.
Now what might be happening is that this is not because it says "HPC" on the label that it is high-performance in any and every situation with all code. Did you consider running that slow last loop through a code profiler ? It would tell you where the time is spent.
Upvotes: 2
Reputation: 4704
For one thing, I usually use a generator to avoid the temporary list of many tiny strings.
out_data = "\t".join("{0:.2f}".format(piece) for piece in norm_data)
But it's hard to tell if this part was the slow one.
Upvotes: 1