Reputation: 3565
Given a Parquet dataset distributed on HDFS (metadata file + may .parquet
parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ...
doesn't work - it merges metadata with actual parquet files..
Upvotes: 2
Views: 4425
Reputation: 31
You may use parquet tools
java -jar parquet-tools.jar merge source/ target/
Upvotes: 0
Reputation: 31
you may use pig
A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');
You may create Impala table on to it, & then use
impala-shell -e "query" -o <output>
same way you may use Mapreduce as well
Upvotes: 1
Reputation: 3565
There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.
spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")
bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath
Since Spark 1.4 it's better to use DataFrame::coalesce(1)
instead of DataFrame::repartition(1)
Upvotes: 2