Google Dataflow: running dynamic query with BigQuery+Pub/Sub in Python

Question

What I would like to do in the pipeline:

Read from pub/sub (done)
Transform this data to dictionary (done)
Take the value of a specified key from the dict (done)
Run a parametrized/dynamic query from BigQuery in which the where part should be like this:
```
SELECT field1 FROM Table where field2 = @valueFromP/S
```

The pipeline

| 'Read from PubSub' >> beam.io.ReadFromPubSub(subscription='')
| 'String to dictionary' >> beam.Map(lambda s:data_ingestion.parse_method(s))
| 'BigQuery' >>

The normal way to read from BQ it would be like:

| 'Read' >> beam.io.Read(beam.io.BigQuerySource(
                query="SELECT field1 FROM table where field2='string'", use_standard_sql=True))

I have read about parameterized queries but i'm not sure if this would work with apache beam.

It could be done using side inputs?

Which would be the best way to do this?

What I've tried:

def parse_methodBQ(input):
    query=''SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])'
    return query


class ReadFromBigQuery(beam.PTransform):
    def expand(self, pcoll):
        return (
                pcoll
                | 'FormatQuery' >> beam.Map(parse_methodBQ)
                | 'Read' >> beam.Map(lambda s:  beam.io.Read(beam.io.BigQuerySource(query=s)))
        )

with beam.Pipeline(options=pipeline_options) as p:
transform = (p  | 'BQ' >> ReadFromBigQuery()

The result (why this?):

The correct result should be like:

{u'Field1': u'string', u'Field2': Bool}

THE SOLUTION

In the pipeline:

| 'BQ' >> beam.Map(parse_method_BQ))

The function (using the BigQuery 0.25 API for dataflow)

def parse_method_BQ(input):
    client = bigquery.Client()
    QUERY = 'SELECT field1 FROM table WHERE field1=\'%s\' AND field2=True' % (input['field1'])
    client.use_legacy_sql = False
    query_job = client.run_async_query(query=QUERY ,job_name='temp-query-job_{}'.format(uuid.uuid4()))  # API request
    query_job.begin()
    while True:
        query_job.reload()  # Refreshes the state via a GET request.
        if query_job.state == 'DONE':
            if query_job.error_result:
                raise RuntimeError(query_job.errors)
            rows = query_job.results().fetch_data()
            for row in rows:
                if not (row[0] is None):  
                    return input
        time.sleep(1)

Nathan Nasser · Accepted Answer

You can read the whole table or use a string query.

I understand that you will use the parse_methodBQ method to customize the query as needed. As this method returns a query, you can call it with BigQuerySource. The rows are in dictionary.

| 'QueryTable' >> beam.Map(beam.io.BigQuerySource(parse_methodBQ))
# Each row is a dictionary where the keys are the BigQuery columns
| 'Read' >> beam.Map(lambda s:  s['data'])

Further more, you can avoid having to customize the query and use a filter method

Regarding the side inputs, review this example from the cookbook to have a better view on how to use them.

Google Dataflow: running dynamic query with BigQuery+Pub/Sub in Python

Answers (1)

Related Questions