Thomas Decaux
Thomas Decaux

Reputation: 22691

Parquet metadata files useful on HDFS?

We use Spark to generate parquet files on HDFS.

Spark generate 4 files, the parquet with data, and 3 meta-data files. The thing is, the 3 meta-data files take 1 block, here 128M, as we run many tasks like this, this could take lot of space for nothing.

Are there files needed? Or is it a good way to deal with?

Upvotes: 1

Views: 1339

Answers (1)

eliasah
eliasah

Reputation: 40380

The metadata file in the parquet output folder is optional and it is not needed by spark to read in parquet files as each parquet file has metadata embedded in it.

On the other hand, it's needed by thrift to read those files.

In Spark 2.0, writing Parquet summary files by default. [Ref. SPARK-15719.]

Upvotes: 3

Related Questions