Speeding up large matrix serialisation in python?

Question

In python (2.7), I'm trying to speed up serialising some very large matrices into a line based format (these work out at ~2-5 billion lines when serialised).

The output format is , where row and col are ints, and value is a float, e.g.:

0 0 0.4
0 1 1.2
...
12521 5498 0.456
12521 5499 0.11

The input data is a scipy.sparse.coo_matrix, and currently serialised using the following:

from __future__ import print_function
from __future__ import unicode_literals

import itertools

# ...code to generate 'matrix' variable skipped ...

with open('outfile', 'w') as fh:
    for i, j, v in itertools.izip(matrix.row, matrix.col, matrix.data):
        print(b"{} {} {}".format(i, j, v), file=fh)

Depending on the input matrix, this can take several hours to run, so even decreasing write time by 10% would be a significant time saving.

pv. · Accepted Answer

Pandas seems to be somewhat faster (you may want to apply it for fixed-size blocks, since it apparently ends up copying the data, to avoid large memory usage)

df = pandas.DataFrame(dict(row=row, col=col, value=value),
                      columns=['row', 'col', 'value'], 
                      copy=False)
df.to_csv('outfile', sep=' ', header=False, index=False)

Yet faster option is a low-level dumping routine written in Cython.

from libc.stdio cimport fprintf, fopen, FILE, fclose

def dump_array(bytes filename, long[:] row, long[:] col, double[:] value):
    cdef FILE *fh
    cdef Py_ssize_t i, n

    n = row.shape[0]

    fh = fopen(filename, "w")
    if fh == NULL:
        raise RuntimeError("file open failed")
    try:
        with nogil:
            for i in range(n):
                fprintf(fh, "%ld %ld %g
", row[i], col[i], value[i])
    finally:
        fclose(fh)

Timings:

original: 5.0 s pandas: 3.1 s Cython: 0.9 s

Speeding up large matrix serialisation in python?

Answers (1)

Related Questions