user1115163
user1115163

Reputation: 51

Pcollection for multiple Tables

I have two bigquery tables.

Table A

c_id    count_c_id  p_id

Table B

id   c_name   p_type  c_id 

Based on columns in Table A, I need to find details from Table B using DF pipelines.

PCollection<TableRow> tableRowBQ = pipeline.apply(BigQueryIO.Read
.named("Read").fromQuery("select c_id,count_c_id,p_id from TableA"));

My requirement is based on c_id return by this query I should be able to get c_name from TableB using pcollection. I am not able to find any example of pcollection for iterating field from one table and fetch data from another table using that field.

Example shared by Google Team as a reference. https://github.com/GoogleCloudPlatform/DataflowSDK-examples/blob/master-1.x/src/main/java/com/google/cloud/dataflow/examples/cookbook/JoinExamples.java.

Upvotes: 0

Views: 970

Answers (1)

greeness
greeness

Reputation: 16124

From the Cloud DataFlow documentation, I think you have two approaches to complete this kind of JOIN task. Depending on how many unique c_ids you have, I would choose one of the two options below.

  • if you don't have many c_ids, I would treat the result from your SELECT query as a side input, and use that as filter to pass into a lookup into Table B;

Here is a code snippet to consume BigQueryIO query result as side input: Apply Side input to BigQueryIO.read operation in Apache Beam.

  • Otherwise, you can use CoGroupByKey. For example,
Table A:
c_id -> count_c_id, p_id

Table B
c_id -> c_name

Since both tables can be seen as a shared key to some different values, you can CoGroup them into

c_id -> ([count_c_id, p_id], [c_name]]

then you can do whatever you want on this "grouped" new PCollection.

Upvotes: 2

Related Questions