What are the compression types supported in parquet

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and lzo also?

Upvotes: 18

Answers (2)

Uwe L. Korn

Reputation: 8796

The supported compression types for Apache Parquet are specified in the parquet-format repository:

/**
 * Supported compression algorithms.
 *
 * Codecs added in 2.4 can be read by readers based on 2.4 and later.
 * Codec support may vary between readers based on the format version and
 * libraries available at runtime. Gzip, Snappy, and LZ4 codecs are
 * widely available, while Zstd and Brotli require additional libraries.
 */
enum CompressionCodec {
  UNCOMPRESSED = 0;
  SNAPPY = 1;
  GZIP = 2;
  LZO = 3;
  BROTLI = 4; // Added in 2.4
  LZ4 = 5;    // Added in 2.4
  ZSTD = 6;   // Added in 2.4
}

https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L461

Snappy and Gzip are the most commonly used ones and are supported by all implementations. LZ4 and ZSTD yield better results the former two but are a rather new addition to the format, so they are only supported in the newer versions of some of the implementations.

Upvotes: 24

Samson Scharfrichter

Reputation: 9067

In Spark 2.1

From the Spark source code, branch 2.1:

You can set the following Parquet-specific option(s) for writing Parquet files:

compression (default is the value specified in spark.sql.parquet.compression.codec): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, gzip, and lzo).
This will overridespark.sql.parquet.compression.codec
...

In Spark 2.4 / 3.0

overall supported compresssions are: none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd

Upvotes: 11

What are the compression types supported in parquet

Answers (2)

In Spark 2.1

In Spark 2.4 / 3.0

Related Questions