Reputation: 49
I want to run a streaming pipeline that reads continuously from a Google bigquery table. Right now my streaming pipeline stops after reading from the bigquery table once. The apache beam documentation doesn't seem to mention about it.
Please help
Design is
I have a base table that has list of users (information like username, id, date of birth). This table gets modified once in every 12 hrs. I want to run a pipeline which takes this table as input and finds the total number of unique users. So instead of running a batch pipeline every 12 hrs on the entire base table, I thought of a streaming pipeline that constantly reads from the base table, has a global window along with a repeated trigger for every 12 hrs to output the unique users count
Upvotes: 1
Views: 602
Reputation: 358
Unfortunately as of today, you can't stream
a BigQuery table natively (using beam SDK).
When the connectors page says that BigQueryIO
is compatible with streaming, it means to sink data, and not to source it. I faced it before (beam SDK docs are not that meaningful).
You can although use some tricks to achieve the same behaviour. One of them is to use PeriodicImpulse.
The second option is, as mentioned before, to use a PubSub to start the pipeline and then use a regular DoFn
to fetch the data in BigQuery.
For both cases, you can read only APPENDS.
It depends on your scenario. I tend to go with the 1st option although.
Upvotes: 0