Reputation: 51
I have two bigquery tables.
Table A
c_id count_c_id p_id
Table B
id c_name p_type c_id
Based on columns in Table A, I need to find details from Table B using DF pipelines.
PCollection<TableRow> tableRowBQ = pipeline.apply(BigQueryIO.Read
.named("Read").fromQuery("select c_id,count_c_id,p_id from TableA"));
My requirement is based on c_id return by this query I should be able to get c_name from TableB using pcollection. I am not able to find any example of pcollection for iterating field from one table and fetch data from another table using that field.
Example shared by Google Team as a reference. https://github.com/GoogleCloudPlatform/DataflowSDK-examples/blob/master-1.x/src/main/java/com/google/cloud/dataflow/examples/cookbook/JoinExamples.java.
Upvotes: 0
Views: 970
Reputation: 16124
From the Cloud DataFlow documentation, I think you have two approaches to complete this kind of JOIN task. Depending on how many unique c_id
s you have, I would choose one of the two options below.
c_id
s, I would treat the result from your SELECT
query as a side input, and use that as filter to pass into a lookup into Table B;Here is a code snippet to consume BigQueryIO query result as side input: Apply Side input to BigQueryIO.read operation in Apache Beam.
CoGroupByKey
. For example,
Table A:
c_id -> count_c_id, p_id
Table B
c_id -> c_name
Since both tables can be seen as a shared key to some different values, you can CoGroup them into
c_id -> ([count_c_id, p_id], [c_name]]
then you can do whatever you want on this "grouped" new PCollection.
Upvotes: 2