Reputation: 69
We are deploying a data consortium between more than 10 companies. Wi will deploy several machine learning models (in general advanced analytics models) for all the companies and we will administrate all the models. We are looking for a solution that administrates several servers, clusters and data science pipelines. I love kedro, but not sure what is the best option to administrate all while using kedro.
In summary, we are looking for the best solution to administrate several models, tasks and pipelines in different servers and possibly Spark clusters. Our current options are:
AWS as our data warehouse and Databricks for administrating servers, clusters and tasks. I don't feel that the notebooks of databricks are a good solution for building pipelines and to work collaboratively, so I would like to connect kedro to databricks (is it good? is it easy to schedule the run of the kedro pipelines using databricks?)
Using GCP for data warehouse and use kubeflow (iin GCP) for deploying models and the administration and the schedule of the pipelines and the needed resources
Setting up servers from ASW or GCP, install kedro and schedule the pipelines with airflow (I see a big problem administrating 20 servers and 40 pipelines)
I would like to know if someone knows what is the best option between these alternatives, their downsides and advantages, or if there are more possibilities.
Upvotes: 5
Views: 1616
Reputation: 3101
I'll try and summarise what I know, but be aware that I've not been part of a KubeFlow project.
Our approach was to build our project with CI and then execute the pipeline from a notebook. We did not use the kedro recommended approach of using databricks-connect due to the large price difference between Jobs and Interactive Clusters (which are needed for DB-connect). If you're working on several TB's of data, this quickly becomes relevant.
As a DS, this approach may feel natural, as a SWE though it does not. Running pipelines in notebooks feels hacky. It works but it feels non-industrialised. Databricks performs well in automatically spinning up and down clusters & taking care of the runtime for you. So their value add is abstracting IaaS away from you (more on that later).
Pro: GCP's main selling point is BigQuery. It is an incredibly powerful platform, simply because you can be productive from day 0. I've seen people build entire web API's on top of it. KubeFlow isn't tied to GCP so you could port this somewhere else later on. Kubernetes will also allow you to run anything else you wish on the cluster, API's, streaming, web services, websites, you name it.
Con: Kubernetes is complex. If you have 10+ engineers to run this project long-term, you should be OK. But don't underestimate the complexity of Kubernetes. It is to the cloud what Linux is to the OS world. Think log management, noisy neighbours (one cluster for web APIs + batch spark jobs), multi-cluster management (one cluster per department/project), security, resource access etc.
Your last alternative, the manual installation of servers is one I would recommend only if you have a large team, extremely large data and are building a long-term product who's revenue can sustain the large maintenance costs.
How does the talent market look like in your region? If you can hire experienced engineers with GCP knowledge, I'd go for the 2nd solution. GCP is a mature, "native" platform in the sense that it abstracts a lot away for customers. If your market has mainly AWS engineers, that may be a better road to take. If you have a number of kedro engineers, that also has relevance. Note that kedro is agnostic enough to run anywhere. It's really just python code.
Subjective advise:
Having worked mostly on AWS projects and a few GCP projects, I'd go for GCP. I'd use the platform's components (BigQuery, Cloud Run, PubSub, Functions, K8S) as a toolbox to choose from and build an organisation around that. Kedro can run in any of these contexts, as a triggered job by the Scheduler, as a container on Kubernetes or as a ETL pipeline bringing data into (or out of) BigQuery.
While Databricks is "less management" than raw AWS, it's still servers to think about and VPC networking charges to worry over. BigQuery is simply GB queried. Functions are simply invocation count. These high level components will allow you to quickly show value to customers and you only need to go deeper (RaaS -> PaaS -> IaaS) as you scale.
AWS also has these higher level abstractions over IaaS but in general, it appears (to me) that Google's offering is the most mature. Mainly because they have published tools they've been using internally for almost a decade whereas AWS has built new tools for the market. AWS is the king of IaaS though.
Finally, a bit of content, two former colleagues have discussed ML industrialisation frameworks earlier this fall
Upvotes: 5