psych0groov3
psych0groov3

Reputation: 701

Convert Julia DataFrame to an array of bytes for compression

So I loaded two datasets from a csv and then merged them using a leftjoin:

using CSV
using DataFrames
using CodecZstd

df1 = CSV.read(joinpath(root, "data", "raw", "df1.csv"), DataFrame)
df2 = CSV.read(joinpath(root, "data", "raw", "df2.csv"), DataFrame)

merged = leftjoin(df1, df2, on=:id)

Now I want to write the merged dataframe to disk as a .zst compressed file (Zstandard compression).

I was successful in first writing to .csv then reading then writing again as .zst but is there a way to directly convert a DataFrame into an array of bytes to be able to save to disk?

Upvotes: 2

Views: 104

Answers (2)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42214

To follow precisely your questions you can do:

using CSV, DataFrames, CodecZstd
fout = ZstdCompressorStream(open("z.zst","w"))
df = DataFrame(a='a':'h', b=1:8)
CSV.write(df ,fout)
close(fout)

Now this can be read as:

julia> CSV.read(ZstdDecompressorStream(open("z.zst")), DataFrame)
8×2 DataFrame
 Row │ a        b
     │ String1  Int64
─────┼────────────────
   1 │ a            1
   2 │ b            2
   3 │ c            3
   4 │ d            4
   5 │ e            5
   6 │ f            6
   7 │ g            7
   8 │ h            8

Other reasonable option would be to use Apache Arrow to write the DataFrame instead of CSV. The compression would compose in the same ways as above.

Upvotes: 5

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69869

There are several options. The one built-in into Julia is to serialize a data frame. You can achieve this by using the Serialialization standard library. It offers two functions serialize for serialization of streams and deserialize for their deserialization. Then you can use CodecZstd.jl to compress the serialized stream and save it to disk.

Note that when you use serialization it is your responsibility to ensure that the Julia and package versions are consistent between the Julia session where you write data and where you read your data.

Upvotes: 3

Related Questions