Reputation: 126
My problem is, I need to read around 50M lines from a file in format
x1 "\t" x2 "\t" .. x10 "\t" count
and then to compute the matrix A with components A[j][i] = Sum (over all lines) count * x_i * x_j.
I tried 2 approaches, both reading the file line per line:
1) keep A a Python matrix and update in for loop:
for j in range(size):
for i in range(size):
A[j][i] += x[j] * x[i] * count
2) make A a numpy array, and update using numpy.add:
numpy.add(A, count * numpy.outer(x, x))
What surprised me is that the 2nd approach has been around 30% slower than the first one. And both are really slow - around 10 minutes for the whole file...
Is there some way to speed up the calculation of the matrix? Maybe there is some function that would read the data entirely from the file (or in large chunks) and not line per line? Any suggestions?
Upvotes: 3
Views: 1004
Reputation: 25033
Your matrix is symmetric, compute just the upper half using your first approach (55 computations per row instead of 100).
The second approach is slower. I don't know why but, if you're instantiating 50M small ndarrays, it is possible that's the bottleneck and possibly using a single ndarray and copying each row data
x = np.zeros((11,))
for l in data.readlines():
x[:] = l.split()
A+=np.outer(x[:-1],x[:-1])*x[-1]
may result in a speedup.
Upvotes: 1
Reputation: 1
Depending on how much memory you have available on you machine, you try using a regular expression to parse the values and numpy reshaping and slicing to apply the calculations. If you run out of memory, consider a similar approach but read the file in, say, 1M line chunks.
txt = open("C:/temp/input.dat").read()
values = re.split("[\t|\n]", txt.strip())
thefloats = [ float(x) for x in values]
mat = np.reshape(thefloats, (num_cols, num_rows))
for i in range(len(counts)):
mat[:-1,i] *= counts[-1,i]
Upvotes: 0
Reputation: 40973
Some thoughts:
pandas.read_csv
with the C
engine to read the file. It is a lot faster than np.genfromtxt because the engine is c/Cython optimized.pandas.read_csv
with the iterator
and chunk_size
parameters set and process chunks at a time. I bet there is an optimal chunk size that will beat the other methods.Upvotes: 2