Reputation: 162
I want to read incoming data on a Google PubSub topic, handle the data and transform it into a unified data structure and then insert it into a dataset in Google BigQuery. From what I understand, it is possible to use some kind of pipeline that streams the data. However, I'm having trouble finding any good and concise examples that achieve this.
My project is written in Scala, so I would prefer examples written in that language. Otherwise something concise in Java works too.
Thanks!
Upvotes: 0
Views: 565
Reputation: 8178
I would say Google Cloud Dataflow is the correct product for your use case. It is used precisely for what you described: read input data from different sources (Pub/Sub in your case), transform it, and write it to a sink (BigQuery here).
Dataflow works with Batch and Streaming Pipelines. In the former, all the data is available at the creation time, while the latter is the version that you need, which continuously reads from an unbounded source (a Pub/Sub subscription, for example), and works on data as soon as it arrives into the Pipeline.
In addition, you will find it useful that the Dataflow team has recently released a beta version of some templates that you can use in order to start working with Dataflow easier. In this case, there is even a Cloud Pub/Sub to BigQuery template available, which you can use as it is, or modify its source code (available in the official GitHub repository) in order to add any transformation you want to apply between the Pub/Sub-read and the BigQuery-write.
Note that the latest Dialogflow Java SDK is based on Apache Beam, which has plenty of documentation and code references that you may find interesting:
Upvotes: 4