Reputation: 3906
In python (2.7), I'm trying to speed up serialising some very large matrices into a line based format (these work out at ~2-5 billion lines when serialised).
The output format is <row> <col> <value>\n
, where row and col are ints, and value is a float, e.g.:
0 0 0.4
0 1 1.2
...
12521 5498 0.456
12521 5499 0.11
The input data is a scipy.sparse.coo_matrix
, and currently serialised using the following:
from __future__ import print_function
from __future__ import unicode_literals
import itertools
# ...code to generate 'matrix' variable skipped ...
with open('outfile', 'w') as fh:
for i, j, v in itertools.izip(matrix.row, matrix.col, matrix.data):
print(b"{} {} {}".format(i, j, v), file=fh)
Depending on the input matrix, this can take several hours to run, so even decreasing write time by 10% would be a significant time saving.
Upvotes: 1
Views: 55
Reputation: 35125
Pandas seems to be somewhat faster (you may want to apply it for fixed-size blocks, since it apparently ends up copying the data, to avoid large memory usage)
df = pandas.DataFrame(dict(row=row, col=col, value=value),
columns=['row', 'col', 'value'],
copy=False)
df.to_csv('outfile', sep=' ', header=False, index=False)
Yet faster option is a low-level dumping routine written in Cython.
from libc.stdio cimport fprintf, fopen, FILE, fclose
def dump_array(bytes filename, long[:] row, long[:] col, double[:] value):
cdef FILE *fh
cdef Py_ssize_t i, n
n = row.shape[0]
fh = fopen(filename, "w")
if fh == NULL:
raise RuntimeError("file open failed")
try:
with nogil:
for i in range(n):
fprintf(fh, "%ld %ld %g\n", row[i], col[i], value[i])
finally:
fclose(fh)
Timings:
original: 5.0 s pandas: 3.1 s Cython: 0.9 s
Upvotes: 2