Reputation: 1291
I have a large table in hdfs which has millions of rows per day. My goal is to write files of the same size, forcing a maximum number of rows to be written in a single file. This is what I am doing to achieve that (spark 2.3.1). For example
spark.sql("select * from dataToWrite where yyyy_mm_dd = '2019-10-01'")
.write
.option("maxRecordsPerFile", 1000000)
.format("orc")
.partitionBy("yyyy_mm_dd")
.mode(SaveMode.Overwrite)
.save(s"hdfs://nameservice1/user/hive/warehouse/aliparoti.db/test_data'")
After that I check hdfs and I get
-rw-rw-rw- 3 aliparoti hive 33.8 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00097-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 596.0 M 2019-10-17 15:23 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 594.2 M 2019-10-17 15:29 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 589.4 M 2019-10-17 15:37 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c002.snappy.orc
-rw-rw-rw- 3 aliparoti hive 101.6 M 2019-10-17 15:38 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c003.snappy.orc
-rw-rw-rw- 3 aliparoti hive 592.1 M 2019-10-17 15:16 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00099-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 33.8 M 2019-10-17 15:17 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00099-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 595.8 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00100-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 33.6 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00100-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 595.2 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00101-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 32.7 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00101-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 595.1 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00102-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 33.9 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00102-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 594.5 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00103-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 34.6 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00103-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 592.0 M 2019-10-17 15:23 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 592.2 M 2019-10-17 15:28 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 67.6 M 2019-10-17 15:29 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c002.snappy.orc
-rw-rw-rw- 3 aliparoti hive 595.4 M 2019-10-17 15:24 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00105-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 33.1 M 2019-10-17 15:24 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00105-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
-rw-rw-rw- 3 aliparoti hive 597.2 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00106-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
-rw-rw-rw- 3 aliparoti hive 35.0 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00106-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
Somehow I get these weird file sizes. Do you know what can I do to achieve chunks of the same size? Do I maybe need a custom partitioner?
Upvotes: 3
Views: 1235
Reputation: 61
Inefficient, as only one core will be doing the job, but should give you expected result
spark.sql("select * from dataToWrite where yyyy_mm_dd = '2019-10-01'")
.repartition("yyyy_mm_dd")
.write
.option("maxRecordsPerFile", 1000000)
.format("orc")
.partitionBy("yyyy_mm_dd")
.mode(SaveMode.Overwrite)
.save(s"hdfs://nameservice1/user/hive/warehouse/aliparoti.db/test_data'")
Upvotes: 1
Reputation: 314
The reason for this behavior is:
Assume that you have 100,05,000 rows of data equally distributed among 10 partitions. So, each partition will now hold 10,00,500 rows. Considering each executor core will be working on each partition, while executing write task, each executor writes 1 file with 10,00,000 rows and another with 500 rows. So, ideally there will be 10 equally sized large files and 10 equally sized small files. This is whats happening in your case.
Solution:
You need to perform a count
and then based on that value you need to come to a number such that rows in each file is somewhere around 10,00,000. Use this number in repartition
method and you will find equal sized files in the end.
Hope this helps!
Upvotes: 1