Reputation: 551
I use MongoSpark to read JSON-data from a MongoDB database as a Spark DataFrame. Now I want to write the JSON-data residing in the DataFrame as Parquet-files and that works like a charm. However, I'm struggling to set compression related options for the generated Parquet-files. I'd like to use Snappy as the codec and also would like to generate "larger" files by specifying the block size for the generated Parquet-files. I don't know how many different approaches I've tested so far but they're numerous. I thought this would be a straightforward thing to do by just "chaining" some .option(...)
statements to the DataFrame.write()
method but so far I've been unsuccessful in my efforts.
What am I doing wrong here?
Upvotes: 1
Views: 5807
Reputation: 13001
You have two options:
spark.sql.parquet.compression.codec
configuration in spark to snappy
. This would be done before creating the spark session (either when you create the config or by changing the default configuration file).df.write.option("compression","snappy").parquet(filename)
Upvotes: 2