gopal
gopal

Reputation: 49

How to read from bigquery in a streaming pipeline written using apache beam

I want to run a streaming pipeline that reads continuously from a Google bigquery table. Right now my streaming pipeline stops after reading from the bigquery table once. The apache beam documentation doesn't seem to mention about it.

Please help

Design is

I have a base table that has list of users (information like username, id, date of birth). This table gets modified once in every 12 hrs. I want to run a pipeline which takes this table as input and finds the total number of unique users. So instead of running a batch pipeline every 12 hrs on the entire base table, I thought of a streaming pipeline that constantly reads from the base table, has a global window along with a repeated trigger for every 12 hrs to output the unique users count

Upvotes: 1

Views: 602

Answers (1)

no-stale-reads
no-stale-reads

Reputation: 358

Unfortunately as of today, you can't stream a BigQuery table natively (using beam SDK). When the connectors page says that BigQueryIOis compatible with streaming, it means to sink data, and not to source it. I faced it before (beam SDK docs are not that meaningful).

You can although use some tricks to achieve the same behaviour. One of them is to use PeriodicImpulse.

The second option is, as mentioned before, to use a PubSub to start the pipeline and then use a regular DoFn to fetch the data in BigQuery.

For both cases, you can read only APPENDS.

It depends on your scenario. I tend to go with the 1st option although.

Upvotes: 0

Related Questions