Andrey Sh
Andrey Sh

Reputation: 126

Fast matrix update with numpy

My problem is, I need to read around 50M lines from a file in format

x1 "\t" x2 "\t" .. x10 "\t" count

and then to compute the matrix A with components A[j][i] = Sum (over all lines) count * x_i * x_j.

I tried 2 approaches, both reading the file line per line:

1) keep A a Python matrix and update in for loop:

  for j in range(size):
    for i in range(size):
      A[j][i] += x[j] * x[i] * count

2) make A a numpy array, and update using numpy.add:

  numpy.add(A, count * numpy.outer(x, x))

What surprised me is that the 2nd approach has been around 30% slower than the first one. And both are really slow - around 10 minutes for the whole file...

Is there some way to speed up the calculation of the matrix? Maybe there is some function that would read the data entirely from the file (or in large chunks) and not line per line? Any suggestions?

Upvotes: 3

Views: 1004

Answers (3)

gboffi
gboffi

Reputation: 25033

Your matrix is symmetric, compute just the upper half using your first approach (55 computations per row instead of 100).

The second approach is slower. I don't know why but, if you're instantiating 50M small ndarrays, it is possible that's the bottleneck and possibly using a single ndarray and copying each row data

x = np.zeros((11,))
for l in data.readlines():
    x[:] = l.split()
    A+=np.outer(x[:-1],x[:-1])*x[-1]

may result in a speedup.

Upvotes: 1

DM__
DM__

Reputation: 1

Depending on how much memory you have available on you machine, you try using a regular expression to parse the values and numpy reshaping and slicing to apply the calculations. If you run out of memory, consider a similar approach but read the file in, say, 1M line chunks.

txt = open("C:/temp/input.dat").read()
values = re.split("[\t|\n]", txt.strip())

thefloats = [ float(x) for x in values]
mat = np.reshape(thefloats, (num_cols, num_rows))

for i in range(len(counts)):
    mat[:-1,i] *= counts[-1,i]   

Upvotes: 0

elyase
elyase

Reputation: 40973

Some thoughts:

  • Use pandas.read_csv with the C engine to read the file. It is a lot faster than np.genfromtxt because the engine is c/Cython optimized.
  • You can read the whole file in memory and then do the calculations. this is the easiest way but from an efficiency perspective your CPU will be mostly idle waiting for input. This time could be better used calculating stuff.
  • You can try to read and process line by line (ex: with the cvs module). While io will still be the bottleneck by the end you will have processed your file. The problem here is that you still will have some efficiency loss due to the Python overhead.
  • Probably the best combination would be to read by chunks using pandas.read_csv with the iterator and chunk_size parameters set and process chunks at a time. I bet there is an optimal chunk size that will beat the other methods.

Upvotes: 2

Related Questions