Reputation: 1249
Input Data:
Experiment 1:
Experiment 1 Result:
Experiment 2:
Experiment 2 Result:
Q1: Any idea how spark determines the number of tasks to read hive table data files? I repeated the same experiments by putting the same data in hdfs and I got similar results.
My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?
Thanks in advance!
Upvotes: 7
Views: 705
Reputation: 26458
The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize
(HDFS), fs.gs.block.size
(GCS), mapreduce.input.fileinputformat.split.minsize
, mapreduce.input.fileinputformat.split.maxsize
). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.
There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize
vs fs.gs.block.size
.
See the following related questions:
Upvotes: 0