Reputation: 701
So I loaded two datasets from a csv and then merged them using a leftjoin
:
using CSV
using DataFrames
using CodecZstd
df1 = CSV.read(joinpath(root, "data", "raw", "df1.csv"), DataFrame)
df2 = CSV.read(joinpath(root, "data", "raw", "df2.csv"), DataFrame)
merged = leftjoin(df1, df2, on=:id)
Now I want to write the merged dataframe to disk as a .zst
compressed file (Zstandard compression).
I was successful in first writing to .csv
then reading then writing again as .zst
but is there a way to directly convert a DataFrame
into an array of bytes to be able to save to disk?
Upvotes: 2
Views: 104
Reputation: 42214
To follow precisely your questions you can do:
using CSV, DataFrames, CodecZstd
fout = ZstdCompressorStream(open("z.zst","w"))
df = DataFrame(a='a':'h', b=1:8)
CSV.write(df ,fout)
close(fout)
Now this can be read as:
julia> CSV.read(ZstdDecompressorStream(open("z.zst")), DataFrame)
8×2 DataFrame
Row │ a b
│ String1 Int64
─────┼────────────────
1 │ a 1
2 │ b 2
3 │ c 3
4 │ d 4
5 │ e 5
6 │ f 6
7 │ g 7
8 │ h 8
Other reasonable option would be to use Apache Arrow to write the DataFrame instead of CSV. The compression would compose in the same ways as above.
Upvotes: 5
Reputation: 69869
There are several options. The one built-in into Julia is to serialize a data frame. You can achieve this by using the Serialialization
standard library. It offers two functions serialize
for serialization of streams and deserialize
for their deserialization. Then you can use CodecZstd.jl to compress the serialized stream and save it to disk.
Note that when you use serialization it is your responsibility to ensure that the Julia and package versions are consistent between the Julia session where you write data and where you read your data.
Upvotes: 3