John
John

Reputation: 11831

How does parallelism works when using Flink SQL?

I understand that in the Flink Datastream world parallelism means each slot will get a subset of events [1].

A Flink program consists of multiple tasks (transformations/operators, data sources, and sinks). A task is split into several parallel instances for execution and each parallel instance processes a subset of the task’s input data. The number of parallel instances of a task is called its parallelism.

However, how does that work in the Flink SQL world where you need to do joins between tables? If the events in tables A and B are being processed in parallel, then does this not mean that the slots will have only some events for those tables A and B in any given slot? How does Flink ensure consistency of results irrespective of the parallelism used, or does it just copy all events to all slots, in which case I don't understand how parallelism assists?

Upvotes: 3

Views: 878

Answers (1)

Sauron
Sauron

Reputation: 1353

  • When a join is executed, Flink redistributes the data across the parallel instances based on the join key. This means that events with the same join key from tables A and B will be sent to the same parallel instance for processing. Flink achieves this by using a hash-based partitioning strategy.
  • By partitioning the data based on the join key, Flink ensures that all events with the same key are processed together, regardless of the parallelism level.
  • The parallelism level determines the number of parallel instances or slots available for processing. Each slot will receive a subset of the data based on some partitioning strategy.
  • Flink does not copy all events to all slots, as that would be inefficient and defeat the purpose of parallelism. Instead, Flink leverages parallelism to distribute the workload across multiple slots.

Upvotes: 2

Related Questions