Kodo
Kodo

Reputation: 551

Specify options while saving Spark DataFrame as Parquet

I use MongoSpark to read JSON-data from a MongoDB database as a Spark DataFrame. Now I want to write the JSON-data residing in the DataFrame as Parquet-files and that works like a charm. However, I'm struggling to set compression related options for the generated Parquet-files. I'd like to use Snappy as the codec and also would like to generate "larger" files by specifying the block size for the generated Parquet-files. I don't know how many different approaches I've tested so far but they're numerous. I thought this would be a straightforward thing to do by just "chaining" some .option(...) statements to the DataFrame.write() method but so far I've been unsuccessful in my efforts.

What am I doing wrong here?

Upvotes: 1

Views: 5807

Answers (1)

Assaf Mendelson
Assaf Mendelson

Reputation: 13001

You have two options:

  1. set the spark.sql.parquet.compression.codec configuration in spark to snappy. This would be done before creating the spark session (either when you create the config or by changing the default configuration file).
  2. df.write.option("compression","snappy").parquet(filename)

Upvotes: 2

Related Questions