Hive on spark reading files

Question

I'm using Hive on Spark. I have gzipped log files in Hadoop. Their size in average was 40 MB, whereas block size was 128 MB. I believed if I concat log files in some way, I will have less blocks, and data reading time will be reduced. E.g. I had log files for each hour (24 files per day -> 24 blocks). After aggregation I have 1 file (24 hours) in 6 blocks.

I've run benchmark tests using Hive and have noticed that reading time and queries executing time after concatenation have increased mote than in 6 times.

The question: what is wrong in my beliefs about Hadoop-Hive on Spark?

David דודו Markovitz · Accepted Answer

Gzipped text files are not split-able.
Your original data was read by multiple mappers.
Your merged data is being read by a single mapper.

Hive on spark reading files

Answers (1)

Related Questions