Best Practice merging files after Spark batch job

Question

we have multiple spark jobs running which connect to different datasources(kafka,oracle, mysql..) and offload/import data via a spark batch.

The job reads the source adds a couple of information and then adds the information to an partitioned (YYYY-MM-DD) hive-parquet table (df...saveAsTable(....)). The jobs are running every 5 minutes. Everything is running pretty smooth so far.

"Problem" is now that we found out that it is a big performance increase if we merge the small files inside of the daily partitions.

For now we just use "insert overwrite table" to overwrite the partition with the same data, through that process the data is merged into bigger files. But the process is manually and feels not really like "BestPractice".

How do you guys deal with that? Must be a very common Issue?

Thanks in advance.

Best Practice merging files after Spark batch job

Answers (1)

Related Questions