Anthony Ma
Anthony Ma

Reputation: 49

how to read data from mysql to flink parallelly?

how to read data from mysql to flink parallelly?I want to build a sourceFunction to continuously read data every a period of time from mysql in parallel.How can I achieve it?

Upvotes: 2

Views: 1486

Answers (1)

Fabian Hueske
Fabian Hueske

Reputation: 18987

The answer to this question includes two aspects:

  1. reading from MySQL (or any other JDBC source) in parallel
  2. reading from MySQL (or any other JDBC source) in periodic intervals

Reading from MySQL in parallel

In order to read from MySQL in parallel, you need to send multiple different queries. The queries must be composed in a way that the union of their results is equivalent to the expected result. For example, you can use range predicates to split queries among a numeric attribute:

Q1: SELECT * FROM sourceT WHERE num < 10;
Q2: SELECT * FROM sourceT WHERE num >= 10 AND num < 20;
Q3: SELECT * FROM sourceT WHERE num >= 20;

There are also other ways to partition a query. But in order to actually gain something, the DBMS must be able to handle multiple queries more efficiently than a single query that queries the whole table. So usually, you want to ensure that the attribute across which you partition (num in the example above) is indexed. Still, shooting multiple queries at a single database instance causes overhead. So, it's not trivial to find the parallelism that gives the best performance.

Reading from MySQL in periodic intervals

This works similar to the parallel read. Again, you need to partition your query. But now you'd like want to do this based on an attribute that describes the time of the records. So in each interval, you want to ask for the rows that were inserted since the last interval. Again, this would be done with a range predicate on the time attribute.

Q at T1: SELECT * FROM sourceT WHERE rowtime < T1;
Q at T2: SELECT * FROM sourceT WHERE rowtime < T2;

As before, this only works efficiently if the table is indexed on the rowtime attribute. Otherwise, you'll do a full table scan and the query will become slower and slower as more data is inserted.

Reading from MySQL in parallel in periodic intervals

For this you "just" have to combine both approaches and add two predicates to each query. What you essentially do is to partition the table into disjunctive parts and read them in parallel over time.

However as I pointed out before, the exact partitioning depends on your data and use case. Moreover, you need to create appropriate indexes to avoid full table scans. Please note as well, that with the approach above, you won't see any updates of rows that are modified after they were read.

Upvotes: 1

Related Questions