Reputation: 8967
I have multiple small parquet
files generated as output of hive ql job, i would like to merge the output files to single parquet file?
what is the best way to do it using some hdfs or linux commands
?
we used to merge the text files using cat
command, but will this work for parquet as well?
Can we do it using HiveQL
itself when writing output files like how we do it using repartition
or coalesc
method in spark
?
Upvotes: 22
Views: 65827
Reputation: 101
Give joinem a try, available via PyPi: python3 -m pip install joinem
.
joinem provides a CLI for fast, flexbile concatenation of tabular data using polars. I/O is lazily streamed in order to give good performance when working with numerous, large files.
Pass input files via stdin and output file as an argument.
ls -1 path/to/*.parquet | python3 -m joinem out.parquet
You can add the --progress
flag to get a progress bar.
If you are working in a HPC environment, joinem can also be conveniently used via singularity/apptainer.
ls -1 *.pqt | singularity run docker://ghcr.io/mmore500/joinem out.pqt
joinem is also compatible with CSV, JSON, and feather file types. See the project's README for more usage examples and a full command-line interface API listing.
disclosure: I am the library author of joinem.
Upvotes: 0
Reputation: 351
Using duckdb :
import duckdb
duckdb.execute("""
COPY (SELECT * FROM '*.parquet') TO 'merge.parquet' (FORMAT 'parquet');
""")
Upvotes: 14
Reputation: 2251
According to this https://issues.apache.org/jira/browse/PARQUET-460 Now you can download the source code and compile parquet-tools which is built in merge command.
java -jar ./target/parquet-tools-1.8.2-SNAPSHOT.jar merge /input_directory/
/output_idr/file_name
Or using a tool like https://github.com/stripe/herringbone
Upvotes: 19
Reputation: 14494
You can also do it using HiveQL
itself, if your execution engine is mapreduce
.
You can set a flag for your query, which causes hive to merge small files at the end of your job:
SET hive.merge.mapredfiles=true;
or
SET hive.merge.mapfiles=true;
if your job is a map-only job.
This will cause the hive job to automatically merge many small parquet files into fewer big files. You can control the number of output files with by adjusting hive.merge.size.per.task
setting. If you want to have just one file, make sure you set it to a value which is always larger than the size of your output. Also, make sure to adjust hive.merge.smallfiles.avgsize
accordingly. Set it to a very low value if you want to make sure that hive always merges files. You can read more about this settings in hive documentation.
Upvotes: 6