paul
paul

Reputation: 31

downloaded bytes from s3 of spark sql is multiple times more than hive sql

i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;

when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size; but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;

sql is simple like 'select count(1) from #table#';

from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution; so one file is accessed by multiple tasks?

any help will be appreciated!

Upvotes: 1

Views: 188

Answers (1)

paul
paul

Reputation: 31

this is because spark will split one file into multi partitions(each partition refer to one task), even file size is less then block.size(64M or 128M);

enter image description here

so in order to decrease map task number, you can decrease conf 'mapreduce.job.maps' (default valued 2, work for csv but not orc format, changed to 80 in my mapred-site.xml);

Upvotes: 0

Related Questions