Reputation: 1235
We have a Python data pipeline that run in our server. It grab data from various sources, aggregate and write data to sqlite databases. The daily runtime is just 1 hours and network maybe 100mb at most. What are our options to migrate this to Google Cloud? We would like to have more reliable scheduling, cloud database and better data analytics options from the data (powerful dashboard and visualization) and easy development. Should we go with serverless or server? Is the pricing free for such low usage?
Upvotes: 0
Views: 225
Reputation: 2113
In Google Cloud Platform BigQuery is great serverless choice - you can start small and grow over time.
With partitioning, cluster and other improvements, we've been successfully using it with UI (4-8k queries per day) with most queries completing under second.
You can also get all data seamlessly ingested from the various sources with millions of files per day to one or many tables with BqTail
Upvotes: 0
Reputation: 75715
There is several requirement to take care before migrating like: Is all your datasources are reachable by a cloud platform?
About the storage and analytics, BigQuery is an amazing product, and work very well with denormalized data. Is your data can be denormalized? Is your job required transactional capabilities?
Is your data need to be requested on website? BigQuery is powerful for analytics but there is about 1s of query warming, not acceptable on website. It's not like CLoud SQL (MySQL or PostgreSQL) response time which is in millis, but limited to some TB (and having good response time with TB in Cloud SQL is a challenge!)
If it's only for dashboarding, you can use Datastudio, it's free and you can cache your BigQuery data with BI-Engine for more responsive dashboards.
If all of this requirements works for you, and if your datasources are publicly accessible on internet (I mean no VPN requirement for accessing them), I can propose you a full serverless solution. This solution is a side use of Google Cloud Service, and that works well!
I wrote an article on similar use and you can take inspiration on it. Cloud Build allows you to run CI pipeline, and you can use Custom Builder: it's a container that you build yourself and that you can run on Cloud Build.
By the way,
Out of BigQuery cost, this pattern cost 0! you have 120 free minutes per day (per billing account) with Cloud Build, Cloud Scheduler is free (up to 3 scheduler per billing account) and Cloud Function/Cloud Run have a huge free tier (here only run some milliseconds).
Streaming to BigQuery is not free but affordable. Half of cent for 100Mb!!
Note: Cloud Run will propose, a day, long running jobs. By the way you could reuse your Cloud Builder container into Cloud Run when the feature will be released. Today, I propose a workaround of this
Upvotes: 1
Reputation: 1917
GCP on a shoestring budget: Google Gives you $300 to spend for first 12 months and there are some services which gives you free usage per month: https://cloud.google.com/free/docs/gcp-free-tier
For example:
You can use BigQuery free-of-charge 1 TB of querying per month and 10 GB of storage each month.
Here's an excellent video on making the most of out of GCP Free tiers: https://www.youtube.com/watch?v=N2OG1w6bGFo&t=818s
Approach to migration:
When moving to cloud you typically choose from one of the following approaches:
1) Rehost (lift-and-shift) no modification to code or architecture
2) Replatform - with minor modifications to code
3) Refactor - with modifications to code and architecture
Obviously you'll get the most cloud benefits (i.e. performance and cost efficiency) with option 3 but it will take longer whereas option 1 is quicker with least amount of benefits.
You can use Cloud Composer for scheduling which is effectively managed apache airflow service. It will allow you to manage batch, streaming and schedule tasks.
Visualisation can be done through Google Data Studio, which can use BigQuery as datasource. Data Studio is free but querying on BigQuery will be chargeable.
BigQuery for data-analytics.
For database you can migrate to managed CloudSQL which supports Postgres and MySQL database types.
Typically serverless stuff is likely to be cost effective if you can accommodate it which obviously will fall into option 3) refactor.
Upvotes: 2
Reputation: 873
for a lift and shift option, you can run your python workload on the Google Compute Engine, which is a virtual machine, but for best use of Google Cloud, i suggest you to:
Here is a tutorial from Google on how to do it: https://cloud.google.com/scheduler/docs/start-and-stop-compute-engine-instances-on-a-schedule
Upvotes: 2