Reputation: 49
how to read data from mysql to flink parallelly?I want to build a sourceFunction to continuously read data every a period of time from mysql in parallel.How can I achieve it?
Upvotes: 2
Views: 1486
Reputation: 18987
The answer to this question includes two aspects:
In order to read from MySQL in parallel, you need to send multiple different queries. The queries must be composed in a way that the union of their results is equivalent to the expected result. For example, you can use range predicates to split queries among a numeric attribute:
Q1: SELECT * FROM sourceT WHERE num < 10;
Q2: SELECT * FROM sourceT WHERE num >= 10 AND num < 20;
Q3: SELECT * FROM sourceT WHERE num >= 20;
There are also other ways to partition a query. But in order to actually gain something, the DBMS must be able to handle multiple queries more efficiently than a single query that queries the whole table. So usually, you want to ensure that the attribute across which you partition (num
in the example above) is indexed. Still, shooting multiple queries at a single database instance causes overhead. So, it's not trivial to find the parallelism that gives the best performance.
This works similar to the parallel read. Again, you need to partition your query. But now you'd like want to do this based on an attribute that describes the time of the records. So in each interval, you want to ask for the rows that were inserted since the last interval. Again, this would be done with a range predicate on the time attribute.
Q at T1: SELECT * FROM sourceT WHERE rowtime < T1;
Q at T2: SELECT * FROM sourceT WHERE rowtime < T2;
As before, this only works efficiently if the table is indexed on the rowtime
attribute. Otherwise, you'll do a full table scan and the query will become slower and slower as more data is inserted.
For this you "just" have to combine both approaches and add two predicates to each query. What you essentially do is to partition the table into disjunctive parts and read them in parallel over time.
However as I pointed out before, the exact partitioning depends on your data and use case. Moreover, you need to create appropriate indexes to avoid full table scans. Please note as well, that with the approach above, you won't see any updates of rows that are modified after they were read.
Upvotes: 1