Artem
Artem

Reputation: 1217

How to set Parquet file encoding in Spark

Parquet documentation describe few different encodings here

Is it changes somehow inside file during read/write, or I can set it? Nothing about it in Spark documentation. Only found slides from speach by Ryan Blue from Netflix team. He sets parquet configurations to sqlContext

sqlContext.setConf("parquet.filter.dictionary.enabled", "true")

Looks like it's not about plain dictionary encoding in Parquet files.

Upvotes: 10

Views: 15358

Answers (3)

pwilson
pwilson

Reputation: 21

Adding on to Artem's answer, parquet-tools has been marked as deprecated and can no longer be installed via homebrew. An alternative way of running this tool is to use an older branch from parquet-mr

git clone https://github.com/apache/parquet-mr.git
git checkout tags/apache-parquet-1.8.1
cd parquet-mr/parquet-tools
mvn clean package -Plocal
java -jar target/parquet-tools-1.8.1.jar meta <your_parquet_file.snappy.parquet>

Upvotes: 0

Artem
Artem

Reputation: 1217

So I found an answer to my question on twitter engineering blog.

Parquet has an automatic dictionary encoding enabled when a number of unique values < 10^5. Here is a post announcing Parquet 1.0 with self-tuning dictionary encoding

UPD:

Dictionary encoding can be switched in SparkSession configs:

SparkSession.builder
            .appName("name")
            .config("parquet.enable.dictionary","false") //true

Regarding encoding by column, there is an open issue as improvement in Parquet’s Jira that was created on 14th July, 17. Since dictionary encoding is a default and works only for all table it turns off Delta Encoding(Jira issue for this bug) which is the only suitable encoding for data like timestamps where almost each value is unique.

UPD2

How can we tell which encoding was used for an output file?

  • I used parquet-tools for it.

    -> brew install parquet-tools (for mac)
    -> parquet-tools meta your_parquet_file.snappy.parquet

Output:

.column_1: BINARY SNAPPY DO:0 FPO:16637 SZ:2912/8114/3.01 VC:26320 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
.column_2: BINARY SNAPPY DO:0 FPO:25526 SZ:119245/711487/1.32 VC:26900 ENC:PLAIN,RLE,BIT_PACKED
.

Where PLAIN and PLAIN_DICTIONARY are encodings which were used for that columns

Upvotes: 13

Partha Mishra
Partha Mishra

Reputation: 312

I'm not sure whether I've understood the entire scope of your query (and in that case, please feel free to clarify).

You can specify storage options for a hive table using "CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')" reference

This one should be easier to follow and more comprehensive

Read/ Write file: val usersDF = spark.read.load("examples/src/main/resources/users.parquet") usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

I'm also assuming you might have already checked: - sqlContext.parquetFile("File_to_be_read.parquet") - myDataFrame.saveAsParquetFile("file_to_be_saved.parquet")

Upvotes: -1

Related Questions