Loading large arrays from file, numpy slower than list appending. Where is the bottle neck?

Question

I have a very large data .dat file that I am loading into a list, or numpy array. The file contains columns of values with the last column being a multiplier of the number of times that value should be loaded into the array. Thus, when loading it, a second sub loop needs to be ran to take that into account. This was done to make the total file size smaller, as they were getting well over 1Gb in some cases.

I did some testing with numpy pre allocated zeros arrays vs appending to a simply list and found the appending was faster. Although, in numerous postings this should not be the case. Where is the bottle neck?

    if True:
            startTime1 = time.clock()
            ## Numpy way to load up the data
            dataAry = np.zeros(TotalSamples)
            multiColAry = np.loadtxt(filename,skiprows=2,                           usecols=(columNum,lastColLoc))
            latestLoc = 0               
            for i in range(0,multiColAry.shape[0]):
                curData = multiColAry[i][0]
                timesBeenHere = int(multiColAry[i][2])
                for j in range(0,timesBeenHere):
                    dataAry[latestLoc] = curData                        
                    latestLoc+=1
            endTime1 = time.clock()
            totalTime = (endTime1-startTime1) # in seconds
            totalTimeString = timeString(totalTime)

        if True:
            #Old string parsing directly version
            startTime1 = time.clock()
            totalAccepted = 0
            f = open(filename,'r')
            dataAry = []
            line = 'asdf'
            while line!='':
                line = f.readline()
                if line!='':
                    dataLineCols = line.split()
                    dataValue = float(dataLineCols[columNum])
                    timesBeenHere = float(dataLineCols[-1])
                    for j in range(0,int(timesBeenHere)):
                        totalAccepted+=1
                        dataAry.append(dataValue)

            f.close()
            endTime1 = time.clock()
            totalTime = (endTime1-startTime1) # in seconds
            totalTimeString = timeString(totalTime)

Thanks for any comments/suggestions.

Bi Rico · Accepted Answer

I believe you can replace your for loop with numpy.repeat(multiColAry[:, 0], multiColAry[:, 2]), that should make a pretty big difference.

Also, numpy arrays are generally index array[i, j, k] instead of array[i][j][k], in this case the results should be the same but in some cases the latter will actually give you the wrong result. In either case the former should be faster.

Lastly element-wise operations and for-loops are discouraged when programming with numpy. Instead array-wise, or "vectorized", code is encouraged. In this paradigm you express the program as operations of arrays instead of operations on their elements. Numpy is optimized for this kind of programing. I know that this is unfamiliar for people coming over from lower level languages like C or Java, but it's similar to other scientific programing languages like Matlab or IDL.

Loading large arrays from file, numpy slower than list appending. Where is the bottle neck?

Answers (2)

Related Questions