Reputation: 22691
We use Spark to generate parquet files on HDFS.
Spark generate 4 files, the parquet with data, and 3 meta-data files. The thing is, the 3 meta-data files take 1 block, here 128M, as we run many tasks like this, this could take lot of space for nothing.
Are there files needed? Or is it a good way to deal with?
Upvotes: 1
Views: 1339
Reputation: 40380
The metadata file in the parquet output folder is optional and it is not needed by spark to read in parquet files as each parquet file has metadata embedded in it.
On the other hand, it's needed by thrift
to read those files.
In Spark 2.0, writing Parquet summary files by default. [Ref. SPARK-15719.]
Upvotes: 3