Oleg Shirokikh
Oleg Shirokikh

Reputation: 3565

Collecting Parquet data from HDFS to local file system

Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files..

Upvotes: 2

Views: 4425

Answers (3)

ganeshTripathi
ganeshTripathi

Reputation: 31

You may use parquet tools java -jar parquet-tools.jar merge source/ target/

Upvotes: 0

ganeshTripathi
ganeshTripathi

Reputation: 31

you may use pig

A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');

You may create Impala table on to it, & then use

impala-shell -e "query" -o <output>

same way you may use Mapreduce as well

Upvotes: 1

Oleg Shirokikh
Oleg Shirokikh

Reputation: 3565

There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.

spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")       
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")

bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath

Since Spark 1.4 it's better to use DataFrame::coalesce(1) instead of DataFrame::repartition(1)

Upvotes: 2

Related Questions