Reputation: 103
I have a very large data .dat file that I am loading into a list, or numpy array. The file contains columns of values with the last column being a multiplier of the number of times that value should be loaded into the array. Thus, when loading it, a second sub loop needs to be ran to take that into account. This was done to make the total file size smaller, as they were getting well over 1Gb in some cases.
I did some testing with numpy pre allocated zeros arrays vs appending to a simply list and found the appending was faster. Although, in numerous postings this should not be the case. Where is the bottle neck?
if True:
startTime1 = time.clock()
## Numpy way to load up the data
dataAry = np.zeros(TotalSamples)
multiColAry = np.loadtxt(filename,skiprows=2, usecols=(columNum,lastColLoc))
latestLoc = 0
for i in range(0,multiColAry.shape[0]):
curData = multiColAry[i][0]
timesBeenHere = int(multiColAry[i][2])
for j in range(0,timesBeenHere):
dataAry[latestLoc] = curData
latestLoc+=1
endTime1 = time.clock()
totalTime = (endTime1-startTime1) # in seconds
totalTimeString = timeString(totalTime)
if True:
#Old string parsing directly version
startTime1 = time.clock()
totalAccepted = 0
f = open(filename,'r')
dataAry = []
line = 'asdf'
while line!='':
line = f.readline()
if line!='':
dataLineCols = line.split()
dataValue = float(dataLineCols[columNum])
timesBeenHere = float(dataLineCols[-1])
for j in range(0,int(timesBeenHere)):
totalAccepted+=1
dataAry.append(dataValue)
f.close()
endTime1 = time.clock()
totalTime = (endTime1-startTime1) # in seconds
totalTimeString = timeString(totalTime)
Thanks for any comments/suggestions.
Upvotes: 1
Views: 1262
Reputation: 25823
I believe you can replace your for loop with numpy.repeat(multiColAry[:, 0], multiColAry[:, 2])
, that should make a pretty big difference.
Also, numpy arrays are generally index array[i, j, k]
instead of array[i][j][k]
, in this case the results should be the same but in some cases the latter will actually give you the wrong result. In either case the former should be faster.
Lastly element-wise operations and for-loops are discouraged when programming with numpy. Instead array-wise, or "vectorized", code is encouraged. In this paradigm you express the program as operations of arrays instead of operations on their elements. Numpy is optimized for this kind of programing. I know that this is unfamiliar for people coming over from lower level languages like C or Java, but it's similar to other scientific programing languages like Matlab or IDL.
Upvotes: 1
Reputation: 15887
In allocating a large zero array you need to clear a lot of memory, so if you allocate a large enough array with np.zeros, it may well start paging and will certainly clear your processor cache within that call alone. Allocating the array using ndarray(shape=(TotalSamples))
argument does not initialize it.
Secondly, the first version of the code keeps track of data the second one discards on the fly. The input file is clearly a text table of numbers, and the first implementation reads columns 0 and 2 while the second reads columns columnNum and -1. This indicates at least one column that the first version keeps in memory as long as multiColAry exists, while the second discards it as it moves on to the next line. You can avoid this using loadtxt(filename, usecols=(0,2))
.
Incidentally, did you know that files are iterable? Your tricky combination using f.readline()
and empty string tests can be replaced with for line in f:
(and strings are false when empty, so you don't need to compare against the empty string).
Also, while using numpy, it is often a good idea not to write inner loops like for j in range(0,timesBeenHere):
in Python. It could be restructured using dataAry[latestLoc:latestLoc+timesBeenHere].fill(curData)
or dataAry.append(multiEntries)
. Creating multiEntries is a chapter in itself, with possibilities like np.ones(timesBeenHere)*dataValue
(traditional but costly if it doesn't fit in cache) or np.linspace(dataValue,dataValue,timesBeenHere)
. Also see ndarray.put().
... I didn't even notice until now that the second version builds a list, not an array. Which brings to mind the question of how the data is later used. For now, I'll just assume it's converted to a numpy array in unseen code.
In the end, I am guessing something like this would probably be most convenient:
dataAry = np.empty(TotalSamples)
i=0
for line in open(filename):
words=line.split()
repeats=int(words[2])
value=float(words[0])
np.copyto(dataAry[i:i+repeats],value)
i+=repeats
There's bound to be some way to use numpy to calculate the indices too, but I don't know which is more costly of parsing manually like this and loading the full table (well, the interesting columns) using loadtxt.
Upvotes: 3