Colin T Bowers
Colin T Bowers

Reputation: 18530

What is the fastest method(s) for reading and writing a matrix of Float64 to file in julia

Let x = randn(100, 2). I want to write x to its own file. This file will contain x, and only x, and x will only ever be of type Matrix{Float64}. In the past, I have always used HDF5 for this, but it occurs to me that this is over-kill, since in this setup I will only have one array per file. Note that JLD uses HDF5, and so is also over-kill.

1) What is the fastest method for reading and writing x assuming I will only ever want to read the entire matrix?

2) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix?

3) What is the fastest method for reading and writing x assuming I might want to read a slice of the matrix, or over-write a slice of the matrix (but not change the matrix size)?

Upvotes: 1

Views: 1485

Answers (3)

Colin T Bowers
Colin T Bowers

Reputation: 18530

Based on the suggestions made by Tasos above, I put together a rudimentary speed test for both writes and reads using 4 different methods:

  1. h5 (using the HDF5 package)
  2. jld (using the JLD2 package)
  3. slz (using serialize and deserialize)
  4. dat (write to a binary file, using the first 128 bits to store the dimension of the matrix)

I've pasted the test code at the bottom of this answer. The results are:

julia> @time f_write_test(N, "h5")
  0.191555 seconds (2.11 k allocations: 76.380 MiB, 26.39% gc time)

julia> @time f_write_test(N, "jld")
  0.774857 seconds (8.33 k allocations: 77.058 MiB, 0.32% gc time)

julia> @time f_write_test(N, "slz")
  0.108687 seconds (2.61 k allocations: 76.495 MiB, 1.91% gc time)

julia> @time f_write_test(N, "dat")
  0.087488 seconds (1.61 k allocations: 76.379 MiB, 1.08% gc time)

julia> @time f_read_test(N, "h5")
  0.051646 seconds (5.81 k allocations: 76.515 MiB, 14.80% gc time)

julia> @time f_read_test(N, "jld")
  0.071249 seconds (10.04 k allocations: 77.136 MiB, 57.60% gc time)

julia> @time f_read_test(N, "slz")
  0.038967 seconds (3.11 k allocations: 76.527 MiB, 22.17% gc time)

julia> @time f_read_test(N, "dat")
  0.068544 seconds (1.81 k allocations: 76.405 MiB, 59.21% gc time)

So for writes, the write to binary option outperforms even serialize, and is twice as fast as HDF5 and almost an order of magnitude faster than JLD2.

For reads, deserialize has the best performance, while HDF5, JLD2 and reading from binary are all fairly close in performance, with HDF5 being slightly ahead.

I haven't included a test for writing to slices, but may come back to this in the future. Obviously writing to slices is impossible using serialize (not to mention the versioning/system image issues that serialize also faces), and I'm not really sure how to do it using JLD2. My gut feel writing a slice to binary will easily beat HDF5 if the slice is contiguous on disk, but will probably be significantly slower than HDF5 if it is non-contiguous and if the HDF5 method optimally exploits chunking. If HDF5 doesn't exploit chunking (which implies knowing at write time what slices you will want), then I suspect the binary method will come out ahead.

In summary, I'm going to go with the binary method, as I think that at this stage it is clearly the overall winner.

I suspect that eventually, JLD2 will probably be the method of choice, but there is a fair way to go here (the package itself is very new so not much time for the community to work on optimisations etc).

Test code follows:

using JLD2, HDF5
f_write_h5(fp::String, x::Matrix{Float64}) = h5write(fp, "G/D", x)
f_write_jld(fp::String, x::Matrix{Float64}) = @save fp x
f_write_slz(fp::String, x::Matrix{Float64}) = open(fid->serialize(fid, x), fp, "w")
f_write_dat_inner(fid1::IOStream, x::Matrix{Float64}) = begin ; write(fid1, size(x,1)) ; write(fid1, size(x,2)) ; write(fid1, x) ; end
f_write_dat(fp::String, x::Matrix{Float64}) = open(fid1->f_write_dat_inner(fid1, x), fp, "w")
f_read_h5(fp::String) = h5read(fp, "G/D")
f_read_jld(fp::String) = @load fp x
f_read_slz(fp::String) = open(deserialize, fp, "r")
f_read_dat_inner(fid1::IOStream) = begin ; d1 = read(fid1, Int) ; d2 = read(fid1, Int) ; read(fid1, Float64, (d1, d2)) ; end
f_read_dat(fp::String) = open(f_read_dat_inner, fp, "r")
function f_write_test(N::Int, filetype::String)
    dp = "/home/colin/Temp/"
    filetype == "h5" && [ f_write_h5("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
    filetype == "jld" && [ f_write_jld("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
    filetype == "slz" && [ f_write_slz("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
    filetype == "dat" && [ f_write_dat("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
    #[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
    nothing
end
function f_read_test(N::Int, filetype::String)
    dp = "/home/colin/Temp/"
    filetype == "h5" && [ f_read_h5("$(dp)$(n).$(filetype)") for n = 1:N ]
    filetype == "jld" && [ f_read_jld("$(dp)$(n).$(filetype)") for n = 1:N ]
    filetype == "slz" && [ f_read_slz("$(dp)$(n).$(filetype)") for n = 1:N ]
    filetype == "dat" && [ f_read_dat("$(dp)$(n).$(filetype)") for n = 1:N ]
    [ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
    nothing
end
f_write_test(1, "h5")
f_write_test(1, "jld")
f_write_test(1, "slz")
f_write_test(1, "dat")
f_read_test(1, "h5")
f_read_test(1, "jld")
f_read_test(1, "slz")
f_read_test(1, "dat")

N = 100
@time f_write_test(N, "h5")
@time f_write_test(N, "jld")
@time f_write_test(N, "slz")
@time f_write_test(N, "dat")
@time f_read_test(N, "h5")
@time f_read_test(N, "jld")
@time f_read_test(N, "slz")
@time f_read_test(N, "dat")

Upvotes: 2

Tasos Papastylianou
Tasos Papastylianou

Reputation: 22215

You could use the serialize function, provided you heed the warnings in the documentation about non-guarantees between versions etc.

serialize(stream::IO, value)

Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original. In general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image. Ptr values are serialized as all-zero bit patterns (NULL).

An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a SerializationState and use it as the first argument to serialize instead. See also Serializer.writeheader.

Really though, JLD (or in fact, its successor, JLD2) is generally the recommended way*.


*Of particular interest to you might be the statements that: "JLD2 saves and loads Julia data structures in a format comprising a subset of HDF5, without any dependency on the HDF5 C library" and that "it typically outperforms the previous JLD package (sometimes by multiple orders of magnitude) and often outperforms Julia's built-in serializer".

Upvotes: 3

Gnimuc
Gnimuc

Reputation: 8566

Julia has two build-in functions readdlm & writedlm for doing this:

julia> x = randn(5, 5)
5×5 Array{Float64,2}:
 -1.2837    -0.641382  0.611415   0.965762   -0.962764 
  0.106015  -0.344429  1.40278    0.862094    0.324521 
 -0.603751   0.515505  0.381738  -0.167933   -0.171438 
 -1.79919   -0.224585  1.05507   -0.753046    0.0545622
 -0.110378  -1.16155   0.774612  -0.0796534  -0.503871 

julia> writedlm("txtmat.txt", x, use_mmap=true)

julia> readdlm("txtmat.txt", use_mmap=true)
5×5 Array{Float64,2}:
 -1.2837    -0.641382  0.611415   0.965762   -0.962764 
  0.106015  -0.344429  1.40278    0.862094    0.324521 
 -0.603751   0.515505  0.381738  -0.167933   -0.171438 
 -1.79919   -0.224585  1.05507   -0.753046    0.0545622
 -0.110378  -1.16155   0.774612  -0.0796534  -0.503871 

Definitely not the fastest way(use Mmap.mmap directly as DanGetz suggested in the comment if performance is a big deal), but it seems this is the simplest way and the output file is human-readable.

Upvotes: 1

Related Questions