Reputation: 1300
What I would like to do in the pipeline:
Run a parametrized/dynamic query from BigQuery in which the where part should be like this:
SELECT field1 FROM Table where field2 = @valueFromP/S
The pipeline
| 'Read from PubSub' >> beam.io.ReadFromPubSub(subscription='')
| 'String to dictionary' >> beam.Map(lambda s:data_ingestion.parse_method(s))
| 'BigQuery' >> <Here is where I'm not sure how to do it>
The normal way to read from BQ it would be like:
| 'Read' >> beam.io.Read(beam.io.BigQuerySource(
query="SELECT field1 FROM table where field2='string'", use_standard_sql=True))
I have read about parameterized queries but i'm not sure if this would work with apache beam.
It could be done using side inputs?
Which would be the best way to do this?
What I've tried:
def parse_methodBQ(input):
query=''SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])'
return query
class ReadFromBigQuery(beam.PTransform):
def expand(self, pcoll):
return (
pcoll
| 'FormatQuery' >> beam.Map(parse_methodBQ)
| 'Read' >> beam.Map(lambda s: beam.io.Read(beam.io.BigQuerySource(query=s)))
)
with beam.Pipeline(options=pipeline_options) as p:
transform = (p | 'BQ' >> ReadFromBigQuery()
The result (why this?):
<Read(PTransform) label=[Read]>
The correct result should be like:
{u'Field1': u'string', u'Field2': Bool}
THE SOLUTION
In the pipeline:
| 'BQ' >> beam.Map(parse_method_BQ))
The function (using the BigQuery 0.25 API for dataflow)
def parse_method_BQ(input):
client = bigquery.Client()
QUERY = 'SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])
client.use_legacy_sql = False
query_job = client.run_async_query(query=QUERY ,job_name='temp-query-job_{}'.format(uuid.uuid4())) # API request
query_job.begin()
while True:
query_job.reload() # Refreshes the state via a GET request.
if query_job.state == 'DONE':
if query_job.error_result:
raise RuntimeError(query_job.errors)
rows = query_job.results().fetch_data()
for row in rows:
if not (row[0] is None):
return input
time.sleep(1)
Upvotes: 0
Views: 1991
Reputation: 1004
You can read the whole table or use a string query.
I understand that you will use the parse_methodBQ method to customize the query as needed. As this method returns a query, you can call it with BigQuerySource. The rows are in dictionary.
| 'QueryTable' >> beam.Map(beam.io.BigQuerySource(parse_methodBQ))
# Each row is a dictionary where the keys are the BigQuery columns
| 'Read' >> beam.Map(lambda s: s['data'])
Further more, you can avoid having to customize the query and use a filter method
Regarding the side inputs, review this example from the cookbook to have a better view on how to use them.
Upvotes: 1