Mohammad Saad
Mohammad Saad

Reputation: 2005

Data compression using Arrow.jl in Julia

I tried to compress data using Arrow.jl. However, the test run using the below code didn’t show any size reduction (or compression). May I seek advice on my implementation, like is there something I am doing wrong? Code:

using CSV, DataFrames, Arrow
df = CSV.read("input_data.csv", DataFrame)
function compress_data(data::DataFrame)
    io = Arrow.tobuffer(data)
    d = Arrow.Table(io; convert=false)
    Arrow.write("output_data.lz4", d; compress=:lz4)
end
compress_data(df)

Look forward to the suggestions. Thanks!

Upvotes: 2

Views: 487

Answers (1)

Mikael Öhman
Mikael Öhman

Reputation: 2375

Code looks fine, and testing it with an input CSV with all zero values, the compression ratio is high.

I suspect the case here is using floating point numbers and there are 2 potentially tricky things to keep in mind here

  1. In the case where the floats are within a small range, e.g. 0. < x < 1., we might expect potential for compression, but we will likely be disappointed as the byte-pattern of floats doesn't lend itself to common compression techniques.
  2. Text representation of a Float64 might truncate decimals and store much less then 8 bytes per value, so it's possible to actually increase the save when saving the binary representation instead.

Compression techniques for floats do exist however, e.g. Blosc, but results are likely to be disappointing unless you are lucky with your data. Lossy compression techniques can achieve high compression rate, e.g. zfp . You can find more information on the topic here on SO: Compressing floating point data

Upvotes: 2

Related Questions