Quirion
Quirion

Reputation: 119

Sorting a parquet dataset using Spark and storing the sorted result as multiple files in S3

In s3://my-bucket/events/date=X/ I have a parquet dataset stored in multiple part files:

Events in the dataset have a timestamp column, a string in ISO 8601. The events in the dataset are completely unsorted.

Using spark, I would like to sort the dataset and store it back in S3, such that:

Details: - Each part file has 200MB - 1GB - The final saved files can contain any number of events, as long as I can control their size somehow. I would like to keep part files with a size smaller than 1GB.

Is it easy to do this in Spark? How could one implement this?/

Upvotes: 0

Views: 1632

Answers (1)

Quirion
Quirion

Reputation: 119

The following worked:

target_path = "s3://..."
events = spark.read.parquet("s3://my-bucket/events/date=X/")
events = events.sort("timestamp", ascending=True)
num_files = ceil(float(events.count()) / EVENTS_PER_FILE)
events.coalesce(num_files).write.parquet(
            target_path,
            mode="overwrite")  # note: overwrite deletes old files

Upvotes: 2

Related Questions