Reputation: 185
we have multiple spark jobs running which connect to different datasources(kafka,oracle, mysql..) and offload/import data via a spark batch.
The job reads the source adds a couple of information and then adds the information to an partitioned (YYYY-MM-DD) hive-parquet table (df...saveAsTable(....)). The jobs are running every 5 minutes. Everything is running pretty smooth so far.
"Problem" is now that we found out that it is a big performance increase if we merge the small files inside of the daily partitions.
For now we just use "insert overwrite table" to overwrite the partition with the same data, through that process the data is merged into bigger files. But the process is manually and feels not really like "BestPractice".
How do you guys deal with that? Must be a very common Issue?
Thanks in advance.
Upvotes: 0
Views: 2763
Reputation: 553
If I am understanding it correctly, you generate the perquet file's with partitions as below.
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file2.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file3.parq
/user/hive/warehouse/table/date=2015-01-01/nameloadfile1/file4.parq
Now, you manually merge those file manually. Instead of that you could do like following which can be done automatically.
df.coalesce(1).write.mode(SaveMode.Overwrite).partitionBy(<partitioncolumn>).perquet(<HIVEtbl>)
The following properties could also come handy. Take a look for these.
spark.sql("SET hive.merge.sparkfiles = true")
spark.sql("SET hive.merge.mapredfiles = true")
spark.sql("SET hive.merge.mapfiles = true")
spark.sql("set hive.merge.smallfiles.avgsize = 128000000")
spark.sql("set hive.merge.size.per.task = 128000000")
Hope, this helps.
Upvotes: 0