rclakmal
rclakmal

Reputation: 1982

How to save a spark RDD in gzip format through pyspark

So I'm saving a spark RDD to a S3 bucket using following code. Is there a way to compress(in gz format) and save instead of saving it as a text file.

help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")

Upvotes: 9

Views: 6838

Answers (1)

zero323
zero323

Reputation: 330163

saveAsTextFile method takes an optional argument which specifies compression codec class:

help_data.repartition(5).saveAsTextFile(
    path="s3://help-test/logs/help",
    compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)

Upvotes: 15

Related Questions