Josh Milthorpe
Josh Milthorpe

Reputation: 1048

SystemML binary format

SystemML comes packaged with a range of scripts that generate random input data files for use by the various algorithms. Each script accepts an option 'format' which determines whether the data files should be written in CSV or binary format.

I've taken a look at the binary files but they're not in any format I recognize. There doesn't appear to be documentation anywhere online. What is the binary format? What fields are in the header? For dense matrices, are the data contiguously packed at the end of the file (IEEE-754 32-bit float), or are there metadata fields spaced throughout the file?

Upvotes: 0

Views: 80

Answers (1)

mboehm7
mboehm7

Reputation: 115

Essentially, our binary format for matrices and frames are hadoop sequence files (single file or directory of part files) of type <MatrixIndexes,MatrixBlock> (with MatrixIndexes being a long-long pair for row/column block indexes) and <LongWritable,FrameBlock>, respectively. So anybody with hadoop io libraries and SystemML in the classpath can consume these files.

In detail, this binary blocked format is our internal tiled matrix representation (with default blocksize of 1K x 1K entries, and hence fixed logical but potentially variable physical size). Any external format provided to SystemML, such as csv or matrix market, is automatically converted into binary block format and all operations work over these binary intermediates. Depending on the backend, there are different representations, though:

  • For singlenode, in-memory operations and storage, the entire matrix is represented as a single block in deserialized form (where we use linearized double arrays for dense and MCSR, CSR, or COO for sparse).
  • For spark operations and storage, a matrix is represented as JavaPairRDD<MatrixIndexes, MatrixBlock> and we use MEMORY_AND_DISK (deserialized) as default storage level in aggregated memory.
  • For mapreduce operations and storage, matrices are actually persisted to sequence files (similar to inputs/outputs).

Furthermore, in serialized form (as written to sequence files or during shuffle), matrix blocks are encoded in one of the following: (1) empty (header: int rows, int cols, byte type), (2) dense (header plus serialized double values), (3) sparse (header plus for each row: nnz per row, followed by column index, value pairs), (4) ultra-sparse (header plus triples of row/column indexes and values, or pairs of row indexes and values for vectors). Note that we also redirect java serialization via writeExternal(ObjectOutput os) and readExternal(ObjectInput is) to the same serialization code path.

There are more details, especially with regard to the recently added compressed matrix blocks and frame blocks - so please ask if you're interested in anything specific here.

Upvotes: 1

Related Questions