Reputation: 1291

Spark write partition in hdfs having files of the same size

I have a large table in hdfs which has millions of rows per day. My goal is to write files of the same size, forcing a maximum number of rows to be written in a single file. This is what I am doing to achieve that (spark 2.3.1). For example

 spark.sql("select * from dataToWrite where yyyy_mm_dd = '2019-10-01'")
      .write
      .option("maxRecordsPerFile", 1000000)
      .format("orc")
      .partitionBy("yyyy_mm_dd")
      .mode(SaveMode.Overwrite)
      .save(s"hdfs://nameservice1/user/hive/warehouse/aliparoti.db/test_data'")

After that I check hdfs and I get

 -rw-rw-rw-   3 aliparoti hive     33.8 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00097-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    596.0 M 2019-10-17 15:23 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    594.2 M 2019-10-17 15:29 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    589.4 M 2019-10-17 15:37 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c002.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    101.6 M 2019-10-17 15:38 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00098-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c003.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    592.1 M 2019-10-17 15:16 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00099-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     33.8 M 2019-10-17 15:17 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00099-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    595.8 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00100-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     33.6 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00100-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    595.2 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00101-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     32.7 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00101-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    595.1 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00102-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     33.9 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00102-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    594.5 M 2019-10-17 15:19 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00103-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     34.6 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00103-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    592.0 M 2019-10-17 15:23 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    592.2 M 2019-10-17 15:28 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     67.6 M 2019-10-17 15:29 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00104-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c002.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    595.4 M 2019-10-17 15:24 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00105-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     33.1 M 2019-10-17 15:24 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00105-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc
 -rw-rw-rw-   3 aliparoti hive    597.2 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00106-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c000.snappy.orc
 -rw-rw-rw-   3 aliparoti hive     35.0 M 2019-10-17 15:20 /user/hive/warehouse/aliparoti.db/test_data/yyyy_mm_dd=2019-10-01/part-00106-9fb1ec99-c7c9-4d41-a183-ca58d18bbe8f.c001.snappy.orc

Somehow I get these weird file sizes. Do you know what can I do to achieve chunks of the same size? Do I maybe need a custom partitioner?

Upvotes: 3

Answers (2)

arbuzov

Reputation: 61

Inefficient, as only one core will be doing the job, but should give you expected result

spark.sql("select * from dataToWrite where yyyy_mm_dd = '2019-10-01'")
    .repartition("yyyy_mm_dd")
    .write
    .option("maxRecordsPerFile", 1000000)
    .format("orc")
    .partitionBy("yyyy_mm_dd")
    .mode(SaveMode.Overwrite)
    .save(s"hdfs://nameservice1/user/hive/warehouse/aliparoti.db/test_data'")

Upvotes: 1

Vihit Shah

Reputation: 314

The reason for this behavior is:

Assume that you have 100,05,000 rows of data equally distributed among 10 partitions. So, each partition will now hold 10,00,500 rows. Considering each executor core will be working on each partition, while executing write task, each executor writes 1 file with 10,00,000 rows and another with 500 rows. So, ideally there will be 10 equally sized large files and 10 equally sized small files. This is whats happening in your case.

Solution:

You need to perform a count and then based on that value you need to come to a number such that rows in each file is somewhere around 10,00,000. Use this number in repartition method and you will find equal sized files in the end.

Hope this helps!

Upvotes: 1

Spark write partition in hdfs having files of the same size

Answers (2)

Related Questions