Reputation: 2179
I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.
Upvotes: 1
Views: 1677
Reputation: 25909
you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark
Upvotes: 1