Reputation: 1211
Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.
I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?
I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.
Any thoughts on a best practice?
Upvotes: 5
Views: 3031
Reputation: 324
Giving a more modern answer in 2024.
In the past there has been some friction, mainly because Kedro are project based while Databricks focus a lot on notebook. This has changed since the introduction of dbx, data assets bundle which allows user work in a IDE while sending job to databricks. There is now a new kedro-databricks community plugin which helps you to get started quickly with data assets bundle https://github.com/JenspederM/kedro-databricks.
On the topic of conf
packaging. Kedro follows the "Twelve Factors App" principle and thus configuration are not shipped with the package. With that said, you can always move conf
into src
so it will be packaged, the only thing you need to change is your project settings.py
, simply update CONF_SOURCE
to the desired directory.
The configuration are expected to be modified, therefore they are external to the package. You will most likely put the configuration on DBFS or similar remote storage. There is an exmaple in our docs, you can find more detail here. https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html#upload-project-data-and-configuration-to-dbfs
Upvotes: 0
Reputation: 1211
I found the best option was to just use another tool for packaging, deploying, and running the job. Using mlflow with kedro seems like a good fit. I do most everything in Kedro but use MLFlow for the packaging and job execution: https://medium.com/@QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5
name: My Project
conda_env: conda.yaml
entry_points:
main:
command: "kedro install && kedro run"
Then running it with:
mlflow run -b databricks -c cluster.json . -P env="staging" --experiment-name /test/exp
Upvotes: 1
Reputation: 265
If you are not using docker
and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.
CI/CD pipeline builds using kedro package
. Creates a wheel file.
Upload dist
and conf
to dbfs or AzureBlob file copy (if using Azure Databricks)
This will upload everything to databricks on every git push
Then you can have a notebook with the following:
from cargoai import run
from cargoai.pipeline import create_pipeline
branch = dbutils.widgets.get("branch")
conf = run.get_config(
project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()
Here conf
, catalog
, and pipeline
will be available
Call this init script when you want to run a branch or a master
branch in production like: %run "/Projects/InitialSetup/load_pipeline" $branch="master"
For development and testing, you can run specific nodespipeline = pipeline.only_nodes_with_tags(*tags)
Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)
In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory
to schedule and run this.
Upvotes: 4
Reputation: 566
So there is a section of the documentation that deals with Databricks:
https://kedro.readthedocs.io/en/latest/04_user_guide/12_working_with_databricks.html
The easiest way to get started will probably be to sync with git and run via a Databricks notebook. However, as mentioned, there are other ways using the ".whl" and referencing the "conf" folder.
Upvotes: 0