Transforming float values using a function is performance bottleneck

Question

I have a piece of software that reads a file and transforms each first value it reads per line using a function (derived from numpy.polyfit and numpy.poly1d functions).

This function has to then write the transformed file away and I wrongly (it seems) assumed that the disk I/O part was the performance bottleneck.

The reason why I claim that it is the transformation that is slowing things down is because I tested the code (listed below) after i changed transformedValue = f(float(values[0])) into transformedValue = 1000.00 and that took the time required down from 1 min to 10 seconds.

I was wondering if anyone knows of a more efficient way to perform repeated transformations like this?

Code snippet:

def transformFile(self, f):
     """ f contains the function returned by numpy.poly1d,
     inputFile is a tab seperated file containing two floats
     per line.
     """
     with open (self.inputFile,'r') as fr:
            for line in fr:
                line = line.rstrip('
')
                values = line.split()
                transformedValue = f(float(values[0]))   # <-------- Bottleneck
                outputBatch.append(str(transformedValue)+" "+values[1]+"
")
            joinedOutput = ''.join(outputBatch)
            with open(output,'w') as fw:
                fw.write(joinedOutput)

The function f is generated by another function, the function fits a 2d degree polynomial through a set of expected floats and a set of measured floats. A snippet from that function is:

    # Perform 2d degree polynomial fit
    z = numpy.polyfit(measuredValues,expectedValues,2)
    f = numpy.poly1d(z)

-- ANSWER --

I have revised the code to vectorize the values prior to transforming them, which significantly speed-up the performance, the code is now as follows:

def transformFile(self, f):
     """ f contains the function returned by numpy.poly1d,
     inputFile is a tab seperated file containing two floats
     per line.
     """
     with open (self.inputFile,'r') as fr:
            outputBatch = []
            x_values = []
            y_values = []
            for line in fr:
                line = line.rstrip('
')
                values = line.split()
                x_values.append(float(values[0]))
                y_values.append(int(values[1]))
            # Transform python list into numpy array
            xArray = numpy.array(x_values)
            newArray = f(xArray)
            # Prepare the outputs as a list
            for index, i in enumerate(newArray):
                outputBatch.append(str(i)+" "+str(y_values[index])+"
")
            # Join the output list elements
            joinedOutput = ''.join(outputBatch)
            with open(output,'w') as fw:
                fw.write(joinedOutput)

Alex Riley · Accepted Answer

It's difficult to suggest improvements without knowing exactly what your function f is doing. Are you able to share it?

However, in general many NumPy operations often work best (read: "fastest") on NumPy array objects rather than when they are repeated multiple times on individual values.

You might like to consider reading the numbers values[0] into a Python list, passing this to a NumPy array and using vectorisable NumPy operations to obtain an array of output values.

Transforming float values using a function is performance bottleneck

Answers (1)

Related Questions