Extension of compressed parquet file in Spark

Question

In my Spark job, I write a compressed parquet file like this:

df
  .repartition(numberOutputFiles)
  .write
  .option("compression","gzip")
  .mode(saveMode)
  .parquet(avroPath)

Then, my files has this extension : file_name .gz.parquet

How can I have ".parquet.gz" ?

mazaneicha · Accepted Answer

I don't believe you can. File extension is hardcoded in ParquetWrite.scala as concatenation of codec's extension and ".parquet", in that order:

  :
    override def getFileExtension(context: TaskAttemptContext): String = {
      CodecConfig.from(context).getCodec.getExtension + ".parquet"
    }
  :

So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))

Extension of compressed parquet file in Spark

Answers (1)

Related Questions