Efficient calculation on complete columns (pytables, hdf5, numpy)

Question

I have a simple HDF5 file (created by PyTables) with ten columns and 100000 rows. For every value I have to apply a simple linear equation, with different parameters per column and write the stuff to CSV.

My naive approach was to loop over the table:

for row in table.iterrows():
    print "%f,%f,..." % (row['a'] * 1.0 + 2.0, row['b'] * 3.0 + 4.0, ...)

But I wonder, whether it would be more efficient to select the columns and calculate them that way and later iterate over the resulting arrays:

a = numpy.add(numpy.multiply(table.cols.a, 1.0), 2.0)
b = numpy.add(numpy.multiply(table.cols.b, 3.0), 4.0)

But this is even slower, it seems.

What is the best way to do this?

Eelco Hoogendoorn · Accepted Answer

Your performance is likely going to be limited by the writing to CSV, but other than that, this problem is exactly what numexpr was made for.

You could use the Expr.set_output method to write your result back to hdf5 instead of iterating over the result and writing to CSV directly, and then look for a more efficient method of converting this result column to CSV in a single optimized call; or find a way to do away with the CSV in the first place, because it does not make much sense to use it if performance is indeed a major concern.

Efficient calculation on complete columns (pytables, hdf5, numpy)

Answers (1)

Related Questions