Reputation: 31
i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;
when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size; but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;
sql is simple like 'select count(1) from #table#';
from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution; so one file is accessed by multiple tasks?
any help will be appreciated!
Upvotes: 1
Views: 188
Reputation: 31
this is because spark will split one file into multi partitions(each partition refer to one task), even file size is less then block.size(64M or 128M);
so in order to decrease map task number, you can decrease conf 'mapreduce.job.maps' (default valued 2, work for csv but not orc format, changed to 80 in my mapred-site.xml);
Upvotes: 0