Sorting a parquet dataset using Spark and storing the sorted result as multiple files in S3

Question

In s3://my-bucket/events/date=X/ I have a parquet dataset stored in multiple part files:

part000.snappy.parquet
part001.snappy.parquet
part002.snappy.parquet
...

Events in the dataset have a timestamp column, a string in ISO 8601. The events in the dataset are completely unsorted.

Using spark, I would like to sort the dataset and store it back in S3, such that:

within each partXXX.snappy.parquet, events are ordered by timestamp
part files with lower XXX have lower timestamps, i.e. timestamps of events in part000 are <= timestamps of events in part 001 <= timestamps of events in part 002, ...

Details: - Each part file has 200MB - 1GB - The final saved files can contain any number of events, as long as I can control their size somehow. I would like to keep part files with a size smaller than 1GB.

Is it easy to do this in Spark? How could one implement this?/

Quirion · Accepted Answer

The following worked:

target_path = "s3://..."
events = spark.read.parquet("s3://my-bucket/events/date=X/")
events = events.sort("timestamp", ascending=True)
num_files = ceil(float(events.count()) / EVENTS_PER_FILE)
events.coalesce(num_files).write.parquet(
            target_path,
            mode="overwrite")  # note: overwrite deletes old files

Sorting a parquet dataset using Spark and storing the sorted result as multiple files in S3

Answers (1)

Related Questions