Parquet metadata files useful on HDFS?

Question

We use Spark to generate parquet files on HDFS.

Spark generate 4 files, the parquet with data, and 3 meta-data files. The thing is, the 3 meta-data files take 1 block, here 128M, as we run many tasks like this, this could take lot of space for nothing.

Are there files needed? Or is it a good way to deal with?

eliasah · Accepted Answer

The metadata file in the parquet output folder is optional and it is not needed by spark to read in parquet files as each parquet file has metadata embedded in it.

On the other hand, it's needed by thrift to read those files.

In Spark 2.0, writing Parquet summary files by default. [Ref. SPARK-15719.]

Parquet metadata files useful on HDFS?

Answers (1)

Related Questions