Reputation: 18530
Let x = randn(100, 2)
. I want to write x
to its own file. This file will contain x
, and only x
, and x
will only ever be of type Matrix{Float64}
. In the past, I have always used HDF5
for this, but it occurs to me that this is over-kill, since in this setup I will only have one array per file. Note that JLD
uses HDF5
, and so is also over-kill.
1) What is the fastest method for reading and writing x
assuming I will only ever want to read the entire matrix?
2) What is the fastest method for reading and writing x
assuming I might want to read a slice of the matrix?
3) What is the fastest method for reading and writing x
assuming I might want to read a slice of the matrix, or over-write a slice of the matrix (but not change the matrix size)?
Upvotes: 1
Views: 1485
Reputation: 18530
Based on the suggestions made by Tasos above, I put together a rudimentary speed test for both writes and reads using 4 different methods:
serialize
and deserialize
)I've pasted the test code at the bottom of this answer. The results are:
julia> @time f_write_test(N, "h5")
0.191555 seconds (2.11 k allocations: 76.380 MiB, 26.39% gc time)
julia> @time f_write_test(N, "jld")
0.774857 seconds (8.33 k allocations: 77.058 MiB, 0.32% gc time)
julia> @time f_write_test(N, "slz")
0.108687 seconds (2.61 k allocations: 76.495 MiB, 1.91% gc time)
julia> @time f_write_test(N, "dat")
0.087488 seconds (1.61 k allocations: 76.379 MiB, 1.08% gc time)
julia> @time f_read_test(N, "h5")
0.051646 seconds (5.81 k allocations: 76.515 MiB, 14.80% gc time)
julia> @time f_read_test(N, "jld")
0.071249 seconds (10.04 k allocations: 77.136 MiB, 57.60% gc time)
julia> @time f_read_test(N, "slz")
0.038967 seconds (3.11 k allocations: 76.527 MiB, 22.17% gc time)
julia> @time f_read_test(N, "dat")
0.068544 seconds (1.81 k allocations: 76.405 MiB, 59.21% gc time)
So for writes, the write to binary option outperforms even serialize
, and is twice as fast as HDF5
and almost an order of magnitude faster than JLD2
.
For reads, deserialize
has the best performance, while HDF5
, JLD2
and reading from binary are all fairly close in performance, with HDF5
being slightly ahead.
I haven't included a test for writing to slices, but may come back to this in the future. Obviously writing to slices is impossible using serialize
(not to mention the versioning/system image issues that serialize
also faces), and I'm not really sure how to do it using JLD2
. My gut feel writing a slice to binary will easily beat HDF5
if the slice is contiguous on disk, but will probably be significantly slower than HDF5
if it is non-contiguous and if the HDF5
method optimally exploits chunking. If HDF5
doesn't exploit chunking (which implies knowing at write time what slices you will want), then I suspect the binary method will come out ahead.
In summary, I'm going to go with the binary method, as I think that at this stage it is clearly the overall winner.
I suspect that eventually, JLD2
will probably be the method of choice, but there is a fair way to go here (the package itself is very new so not much time for the community to work on optimisations etc).
Test code follows:
using JLD2, HDF5
f_write_h5(fp::String, x::Matrix{Float64}) = h5write(fp, "G/D", x)
f_write_jld(fp::String, x::Matrix{Float64}) = @save fp x
f_write_slz(fp::String, x::Matrix{Float64}) = open(fid->serialize(fid, x), fp, "w")
f_write_dat_inner(fid1::IOStream, x::Matrix{Float64}) = begin ; write(fid1, size(x,1)) ; write(fid1, size(x,2)) ; write(fid1, x) ; end
f_write_dat(fp::String, x::Matrix{Float64}) = open(fid1->f_write_dat_inner(fid1, x), fp, "w")
f_read_h5(fp::String) = h5read(fp, "G/D")
f_read_jld(fp::String) = @load fp x
f_read_slz(fp::String) = open(deserialize, fp, "r")
f_read_dat_inner(fid1::IOStream) = begin ; d1 = read(fid1, Int) ; d2 = read(fid1, Int) ; read(fid1, Float64, (d1, d2)) ; end
f_read_dat(fp::String) = open(f_read_dat_inner, fp, "r")
function f_write_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_write_h5("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "jld" && [ f_write_jld("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "slz" && [ f_write_slz("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
filetype == "dat" && [ f_write_dat("$(dp)$(n).$(filetype)", randn(1000, 100)) for n = 1:N ]
#[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
function f_read_test(N::Int, filetype::String)
dp = "/home/colin/Temp/"
filetype == "h5" && [ f_read_h5("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "jld" && [ f_read_jld("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "slz" && [ f_read_slz("$(dp)$(n).$(filetype)") for n = 1:N ]
filetype == "dat" && [ f_read_dat("$(dp)$(n).$(filetype)") for n = 1:N ]
[ rm("$(dp)$(n).$(filetype)") for n = 1:N ]
nothing
end
f_write_test(1, "h5")
f_write_test(1, "jld")
f_write_test(1, "slz")
f_write_test(1, "dat")
f_read_test(1, "h5")
f_read_test(1, "jld")
f_read_test(1, "slz")
f_read_test(1, "dat")
N = 100
@time f_write_test(N, "h5")
@time f_write_test(N, "jld")
@time f_write_test(N, "slz")
@time f_write_test(N, "dat")
@time f_read_test(N, "h5")
@time f_read_test(N, "jld")
@time f_read_test(N, "slz")
@time f_read_test(N, "dat")
Upvotes: 2
Reputation: 22215
You could use the serialize
function, provided you heed the warnings in the documentation about non-guarantees between versions etc.
serialize(stream::IO, value)
Write an arbitrary value to a stream in an opaque format, such that it can be read back by deserialize. The read-back value will be as identical as possible to the original. In general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image. Ptr values are serialized as all-zero bit patterns (NULL).
An 8-byte identifying header is written to the stream first. To avoid writing the header, construct a SerializationState and use it as the first argument to serialize instead. See also Serializer.writeheader.
Really though, JLD (or in fact, its successor, JLD2) is generally the recommended way*.
*Of particular interest to you might be the statements that: "JLD2 saves and loads Julia data structures in a format comprising a subset of HDF5, without any dependency on the HDF5 C library" and that "it typically outperforms the previous JLD package (sometimes by multiple orders of magnitude) and often outperforms Julia's built-in serializer".
Upvotes: 3
Reputation: 8566
Julia has two build-in functions readdlm
& writedlm
for doing this:
julia> x = randn(5, 5)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
julia> writedlm("txtmat.txt", x, use_mmap=true)
julia> readdlm("txtmat.txt", use_mmap=true)
5×5 Array{Float64,2}:
-1.2837 -0.641382 0.611415 0.965762 -0.962764
0.106015 -0.344429 1.40278 0.862094 0.324521
-0.603751 0.515505 0.381738 -0.167933 -0.171438
-1.79919 -0.224585 1.05507 -0.753046 0.0545622
-0.110378 -1.16155 0.774612 -0.0796534 -0.503871
Definitely not the fastest way(use Mmap.mmap
directly as DanGetz suggested in the comment if performance is a big deal), but it seems this is the simplest way and the output file is human-readable.
Upvotes: 1