Reputation: 110211
I have two sources that I need to join together. Let's say each of these sources are about 100M rows of data, and I want to join two the results of two queries that run against these sources. So, conceptually speaking the join query looks like this:
SELECT *
FROM
(query1 against ElasticSearch source -- results are of unknown size)
JOIN
(query2 against BigQuery source -- results are of unknown size)
ON query1.joinkey = query2.joinkey
In other words, the results of query1 could be anywhere from 0 bytes/rows to 10GB/100M rows. Same with the results of query2.
How does Apache Beam or Cloud Dataflow deal with 'unknown-sized' joins? For example, in the case where I define two run-time queries. Additionally, for the above case, is Apache Beam a good resource to use or might there be better options?
I suppose in the case where the two sizes may be of unlimited size, it might work best to do the join (conceptually at least) as:
WITH query1 AS (
es_query -> materialize to BigQuery
)
SELECT *
FROM
query1 JOIN query2 USING (joinkey)
Upvotes: 1
Views: 1382
Reputation: 2024
There are several ways you can approach a join in an Apache Beam pipeline. For example,
Both approaches do not have a set size limit hence should work for arbitrary large datasets. But side-inputs are better suited for cases where a relatively small dataset is fed (and iterated over) while processing a large dataset as the main input. Hence if both datasets are large and are relatively the same size, CoGroupByKey might better suite your case.
For additional code examples for Java and Python see here and here.
Upvotes: 1