downloaded bytes from s3 of spark sql is multiple times more than hive sql

Question

i have a hive table on aws s3 which contains 144 csv formatted files(20M per file) and total size is 3G;

when i execute sql by spark sql, it cost 10-15G downloaded bytes(not same everytime, counted by aws service), much more then hive table size; but when i execute same sql by hive on hive client, the downloaded bytes is equal to hive table size on s3;

sql is simple like 'select count(1) from #table#';

from spark ui stages tag, there is almost 2k+ task, much greater than spark rdd read execution; so one file is accessed by multiple tasks?

any help will be appreciated!

downloaded bytes from s3 of spark sql is multiple times more than hive sql

Answers (1)

Related Questions