Reputation: 408
I'm using Hive on Spark. I have gzipped log files in Hadoop. Their size in average was 40 MB, whereas block size was 128 MB. I believed if I concat log files in some way, I will have less blocks, and data reading time will be reduced. E.g. I had log files for each hour (24 files per day -> 24 blocks). After aggregation I have 1 file (24 hours) in 6 blocks.
I've run benchmark tests using Hive and have noticed that reading time and queries executing time after concatenation have increased mote than in 6 times.
The question: what is wrong in my beliefs about Hadoop-Hive on Spark?
Upvotes: 1
Views: 96
Reputation: 44911
Gzipped text files are not split-able.
Your original data was read by multiple mappers.
Your merged data is being read by a single mapper.
Upvotes: 2