Reputation: 119
In s3://my-bucket/events/date=X/
I have a parquet dataset stored in multiple part files:
part000.snappy.parquet
part001.snappy.parquet
part002.snappy.parquet
Events in the dataset have a timestamp
column, a string in ISO 8601. The events in the dataset are completely unsorted.
Using spark, I would like to sort the dataset and store it back in S3, such that:
partXXX.snappy.parquet
, events are ordered by timestampDetails: - Each part file has 200MB - 1GB - The final saved files can contain any number of events, as long as I can control their size somehow. I would like to keep part files with a size smaller than 1GB.
Is it easy to do this in Spark? How could one implement this?/
Upvotes: 0
Views: 1632
Reputation: 119
The following worked:
target_path = "s3://..."
events = spark.read.parquet("s3://my-bucket/events/date=X/")
events = events.sort("timestamp", ascending=True)
num_files = ceil(float(events.count()) / EVENTS_PER_FILE)
events.coalesce(num_files).write.parquet(
target_path,
mode="overwrite") # note: overwrite deletes old files
Upvotes: 2