dres
dres

Reputation: 1211

Kedro deployment to databricks

Maybe I misunderstand the purpose of packaging but it doesn't seem to helpful in creating an artifact for production deployment because it only packages code. It leaves out the conf, data, and other directories that make the kedro project reproducible.

I understand that I can use docker or airflow plugins for deployment but what about deploying to databricks. Do you have any advice here?

I was thinking about making a wheel that could be installed on the cluster but I would need to package the conf first. Another option is to just sync a git workspace to the cluster and run kedro via a notebook.

Any thoughts on a best practice?

Upvotes: 5

Views: 3031

Answers (4)

mediumnok
mediumnok

Reputation: 324

Giving a more modern answer in 2024.

In the past there has been some friction, mainly because Kedro are project based while Databricks focus a lot on notebook. This has changed since the introduction of dbx, data assets bundle which allows user work in a IDE while sending job to databricks. There is now a new kedro-databricks community plugin which helps you to get started quickly with data assets bundle https://github.com/JenspederM/kedro-databricks.

On the topic of conf packaging. Kedro follows the "Twelve Factors App" principle and thus configuration are not shipped with the package. With that said, you can always move conf into src so it will be packaged, the only thing you need to change is your project settings.py, simply update CONF_SOURCE to the desired directory.

The configuration are expected to be modified, therefore they are external to the package. You will most likely put the configuration on DBFS or similar remote storage. There is an exmaple in our docs, you can find more detail here. https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html#upload-project-data-and-configuration-to-dbfs

Upvotes: 0

dres
dres

Reputation: 1211

I found the best option was to just use another tool for packaging, deploying, and running the job. Using mlflow with kedro seems like a good fit. I do most everything in Kedro but use MLFlow for the packaging and job execution: https://medium.com/@QuantumBlack/deploying-and-versioning-data-pipelines-at-scale-942b1d81b5f5

name: My Project

conda_env: conda.yaml

entry_points:
  main:
    command: "kedro install && kedro run"

Then running it with:

mlflow run -b databricks -c cluster.json . -P env="staging" --experiment-name /test/exp

Upvotes: 1

mayurc
mayurc

Reputation: 265

If you are not using docker and just using kedro to deploy directly on a databricks cluster. This is how we have been deploying kedro to databricks.

  1. CI/CD pipeline builds using kedro package. Creates a wheel file.

  2. Upload dist and conf to dbfs or AzureBlob file copy (if using Azure Databricks)

This will upload everything to databricks on every git push

Then you can have a notebook with the following:

  1. You can have an init script in databricks something like:
from cargoai import run
from cargoai.pipeline import create_pipeline

branch = dbutils.widgets.get("branch")

conf = run.get_config(
    project_path=f"/dbfs/project_name/build/cicd/{branch}"
)
catalog = run.create_catalog(config=conf)
pipeline = create_pipeline()

Here conf, catalog, and pipeline will be available

  1. Call this init script when you want to run a branch or a master branch in production like:
    %run "/Projects/InitialSetup/load_pipeline" $branch="master"

  2. For development and testing, you can run specific nodes
    pipeline = pipeline.only_nodes_with_tags(*tags)

  3. Then run a full or a partial pipeline with just SequentialRunner().run(pipeline, catalog)

In production, this notebook can be scheduled by databricks. If you are on Azure Databricks, you can use Azure Data Factory to schedule and run this.

Upvotes: 4

Tom Goldenberg
Tom Goldenberg

Reputation: 566

So there is a section of the documentation that deals with Databricks:

https://kedro.readthedocs.io/en/latest/04_user_guide/12_working_with_databricks.html

The easiest way to get started will probably be to sync with git and run via a Databricks notebook. However, as mentioned, there are other ways using the ".whl" and referencing the "conf" folder.

Upvotes: 0

Related Questions