Alex Turdean
Alex Turdean

Reputation: 69

GCP - Creating a Dataflow (Pub/Sub -> prediction(ML model) -> BigQuery/Firebase)

I am new to GCP and I want to create a dataflow for my project. Long story short, my devices send data to Pub/Sub and after that, I want to make a prediction using a ML model and then output all of these to BigQuery and a realtime firebase database. I found this article from google(i looked at Stream + Micro-batching but failed to implment it) and this github repository but I really don't know how to run it, if anyone can give me a hand I would be really grateful.

Would it be easier to implement all of these with cloud functions?

Upvotes: 0

Views: 482

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75715

There is several ways to address your use case.

First of all, I'm not sure that Dataflow is required. Dataflow is perfect for data transformation, or data comparison as described in the article, but I'm not sure that is your use case. If so, here several proposal (we could dig into one if you want)

  • The cheaper one is not scalable: Set a pull subscription on your PubSub topic. Then, set a Cloud Scheduler that call an HTTP service (like Cloud Function, or Cloud Run) every 2 minutes. The HTTP service will pull the subscription. For each message, it performs a prediction and store the result in memory (in an array). When all the messages have been processed, you perform a load job into BigQuery (or a batch insert in Datastore).

This solution is the cheaper because you process the message by micro batch (more efficient in processing time) and you perform a load job into BigQuery (which is free compare to streaming). However, it's not scalable because you keep your data in memory before triggering a Load Job. If you have more and more data, you can reach the memory limit of 2Gb of Cloud Run or Cloud Function. Increasing the scheduler frequency is not an option because you have a quota of 1000 load jobs per day (1 day = 1440 minutes -> Thereby, every minute is not possible).

  • The easier one is the most expensive: On your pubSub plugs an HTTP Service (Cloud Run or Cloud Function -> Cloud Run works only with push subscription, Cloud Function works with pull and push subscription). On each message, the HTTP Service is called and performs a prediction and then stream write to BigQuery

This solutions is highly scalable, and the most expensive one. I recommend you Cloud Run that allow to process several message concurrently and thus to decrease the billable instance processing time. (I wrote an article on this)

Eventually, the best option is to perform a mix of both if you don't have to process the message as soon as possible: Schedule microbatch to pull the pubsub pull subscription. For each message performs a prediction and stream write to BigQuery (to prevent memory overflow).

If you really need to use Dataflow in your process, please describe more what do you want to achieve for better advice.

In any case, I agree with the comment of @JohnHanley, perform Qwiklabs to have idea on what you can do with the platform!

Upvotes: 3

Related Questions