user1615898
user1615898

Reputation: 1235

How to migrate local data workflow to Google Cloud?

We have a Python data pipeline that run in our server. It grab data from various sources, aggregate and write data to sqlite databases. The daily runtime is just 1 hours and network maybe 100mb at most. What are our options to migrate this to Google Cloud? We would like to have more reliable scheduling, cloud database and better data analytics options from the data (powerful dashboard and visualization) and easy development. Should we go with serverless or server? Is the pricing free for such low usage?

Upvotes: 0

Views: 225

Answers (4)

Adrian
Adrian

Reputation: 2113

In Google Cloud Platform BigQuery is great serverless choice - you can start small and grow over time.

With partitioning, cluster and other improvements, we've been successfully using it with UI (4-8k queries per day) with most queries completing under second.

You can also get all data seamlessly ingested from the various sources with millions of files per day to one or many tables with BqTail

Upvotes: 0

guillaume blaquiere
guillaume blaquiere

Reputation: 75715

There is several requirement to take care before migrating like: Is all your datasources are reachable by a cloud platform?

About the storage and analytics, BigQuery is an amazing product, and work very well with denormalized data. Is your data can be denormalized? Is your job required transactional capabilities?

Is your data need to be requested on website? BigQuery is powerful for analytics but there is about 1s of query warming, not acceptable on website. It's not like CLoud SQL (MySQL or PostgreSQL) response time which is in millis, but limited to some TB (and having good response time with TB in Cloud SQL is a challenge!)

If it's only for dashboarding, you can use Datastudio, it's free and you can cache your BigQuery data with BI-Engine for more responsive dashboards.

If all of this requirements works for you, and if your datasources are publicly accessible on internet (I mean no VPN requirement for accessing them), I can propose you a full serverless solution. This solution is a side use of Google Cloud Service, and that works well!

I wrote an article on similar use and you can take inspiration on it. Cloud Build allows you to run CI pipeline, and you can use Custom Builder: it's a container that you build yourself and that you can run on Cloud Build.

By the way,

  1. Package your current workflow in a container compliant with Cloud Build, and write your Cloud Build jobs (don't forget to set the right timeout value)
  2. Create a Cloud Function or Cloud Run (if you prefer container) that run Cloud Build; with optionally some substitutions variable for customizing your run.
  3. Set up a Cloud Scheduler to trigger every day your Cloud Run or Cloud Function

Out of BigQuery cost, this pattern cost 0! you have 120 free minutes per day (per billing account) with Cloud Build, Cloud Scheduler is free (up to 3 scheduler per billing account) and Cloud Function/Cloud Run have a huge free tier (here only run some milliseconds).

Streaming to BigQuery is not free but affordable. Half of cent for 100Mb!!

Note: Cloud Run will propose, a day, long running jobs. By the way you could reuse your Cloud Builder container into Cloud Run when the feature will be released. Today, I propose a workaround of this

Upvotes: 1

Parth Mehta
Parth Mehta

Reputation: 1917

GCP on a shoestring budget: Google Gives you $300 to spend for first 12 months and there are some services which gives you free usage per month: https://cloud.google.com/free/docs/gcp-free-tier

For example:

You can use BigQuery free-of-charge 1 TB of querying per month and 10 GB of storage each month.

Here's an excellent video on making the most of out of GCP Free tiers: https://www.youtube.com/watch?v=N2OG1w6bGFo&t=818s

Approach to migration:

When moving to cloud you typically choose from one of the following approaches:

1) Rehost (lift-and-shift) no modification to code or architecture

2) Replatform - with minor modifications to code

3) Refactor - with modifications to code and architecture

Obviously you'll get the most cloud benefits (i.e. performance and cost efficiency) with option 3 but it will take longer whereas option 1 is quicker with least amount of benefits.

You can use Cloud Composer for scheduling which is effectively managed apache airflow service. It will allow you to manage batch, streaming and schedule tasks.

Visualisation can be done through Google Data Studio, which can use BigQuery as datasource. Data Studio is free but querying on BigQuery will be chargeable.

BigQuery for data-analytics.

For database you can migrate to managed CloudSQL which supports Postgres and MySQL database types.

Typically serverless stuff is likely to be cost effective if you can accommodate it which obviously will fall into option 3) refactor.

Upvotes: 2

LeandroHumb
LeandroHumb

Reputation: 873

for a lift and shift option, you can run your python workload on the Google Compute Engine, which is a virtual machine, but for best use of Google Cloud, i suggest you to:

  • Spin up a Google Compute Engine
  • Run your Python Workload
  • Save your data on Google Big Query
  • Shutdown your VM
  • Schedule it using the Cloud Scheduler

Here is a tutorial from Google on how to do it: https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule

Upvotes: 2

Related Questions