kf2
kf2

Reputation: 185

Best Practice merging files after Spark batch job

we have multiple spark jobs running which connect to different datasources(kafka,oracle, mysql..) and offload/import data via a spark batch.

The job reads the source adds a couple of information and then adds the information to an partitioned (YYYY-MM-DD) hive-parquet table (df...saveAsTable(....)). The jobs are running every 5 minutes. Everything is running pretty smooth so far.

"Problem" is now that we found out that it is a big performance increase if we merge the small files inside of the daily partitions.

For now we just use "insert overwrite table" to overwrite the partition with the same data, through that process the data is merged into bigger files. But the process is manually and feels not really like "BestPractice".

How do you guys deal with that? Must be a very common Issue?

Thanks in advance.

Upvotes: 0

Views: 2763

Answers (1)

Gourav Dutta
Gourav Dutta

Reputation: 553

If I am understanding it correctly, you generate the perquet file's with partitions as below.

/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file2.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file3.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file4.parq

Now, you manually merge those file manually. Instead of that you could do like following which can be done automatically.

df.coalesce(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).perquet(<HIVEtbl>)

The following properties could also come handy. Take a look for these.

spark.sql("SET hive.merge.sparkfiles = true")  
spark.sql("SET hive.merge.mapredfiles = true")
spark.sql("SET hive.merge.mapfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
spark.sql("set hive.merge.size.per.task = 128000000")

Hope, this helps.

Upvotes: 0

Related Questions