Reputation: 2005
I tried to compress data using Arrow.jl
. However, the test run using the below code didn’t show any size reduction (or compression). May I seek advice on my implementation, like is there something I am doing wrong?
Code:
using CSV, DataFrames, Arrow
df = CSV.read("input_data.csv", DataFrame)
function compress_data(data::DataFrame)
io = Arrow.tobuffer(data)
d = Arrow.Table(io; convert=false)
Arrow.write("output_data.lz4", d; compress=:lz4)
end
compress_data(df)
Look forward to the suggestions. Thanks!
Upvotes: 2
Views: 487
Reputation: 2375
Code looks fine, and testing it with an input CSV with all zero values, the compression ratio is high.
I suspect the case here is using floating point numbers and there are 2 potentially tricky things to keep in mind here
0. < x < 1.
, we might expect potential for compression, but we will likely be disappointed as the byte-pattern of floats doesn't lend itself to common compression techniques.Float64
might truncate decimals and store much less then 8 bytes per value, so it's possible to actually increase the save when saving the binary representation instead.Compression techniques for floats do exist however, e.g. Blosc, but results are likely to be disappointing unless you are lucky with your data. Lossy compression techniques can achieve high compression rate, e.g. zfp . You can find more information on the topic here on SO: Compressing floating point data
Upvotes: 2