dykw
dykw

Reputation: 1249

How spark (2.3 or new version) determine the number of tasks to read hive table files in gs bucket or hdfs?

Input Data:

Experiment 1:

Experiment 1 Result:

Experiment 2:

Experiment 2 Result:

Q1: Any idea how spark determines the number of tasks to read hive table data files? I repeated the same experiments by putting the same data in hdfs and I got similar results.

My understanding is that the number of tasks to read hive table files should be the same as the number of blocks in hdfs. Q2: Is that correct? Q3: Is that also correct when data is in gs bucket (instead of hdfs)?

Thanks in advance!

Upvotes: 7

Views: 705

Answers (1)

Dagang Wei
Dagang Wei

Reputation: 26458

The number of tasks in one stage is equal to the number of partitions of the input data, which is in turn determined by the data size and the related configs (dfs.blocksize (HDFS), fs.gs.block.size (GCS), mapreduce.input.fileinputformat.split.minsize, mapreduce.input.fileinputformat.split.maxsize). For a complex query which involves multiple stages, it is the sum of the number of tasks of all stages.

There is no difference between HDFS and GCS, except they use different configs for block size, dfs.blocksize vs fs.gs.block.size.

See the following related questions:

Upvotes: 0

Related Questions