Emilie Picard-Cantin
Emilie Picard-Cantin

Reputation: 290

How to create a Databricks job using a Python file outside of dbfs?

I am fairly new to Databricks, so forgive me for the lack of knowledge here. I am using the Databricks resource in Azure. I mainly use the UI right now, but I know some features are only available using databricks-cli, which I have setup but not used yet.

I have cloned my Git repo in Databricks Repos using the UI. Inside my repo, there is a Python file that I will like to run as a job.

Can I use Databricks Jobs to create a job that will call this Python file directly ? The only way that I have been able to make this work is to create and upload to dbfs another Python file that will call the file in my Databricks Repo.

Maybe it cannot be done, or maybe the path I use is incorrect. I tried with the following path structure when creating a job using a Python file and it did not work, unfortunately.

file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

Upvotes: 8

Views: 4953

Answers (4)

Jortega
Jortega

Reputation: 3790

Here is an example of a way to use the databricks sdk to run a python file in a path like given above. file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py

Get the latest sdk. https://pypi.org/project/databricks-sdk/

pip install databricks-sdk

Replace the text in the below variables host, token, python_path and c.cluster_name.


from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs, compute
import time

def main():
    #auth: https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html
    w = WorkspaceClient(
        host="https://...",
        token="YOUR_TOKEN"
    )
    python_path = "/Repos/<user_folder>/<repo_name>/my_python_file.py"
    cluster_id = None
    for c in w.clusters.list():
        if c.cluster_name == "CLUSTER_NAME":
            cluster_id = c.cluster_id
    #Create and run a job: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html
    created_job = w.jobs.create(name=f'sdk-test-{time.time_ns()}',
                                tasks=[
                                    jobs.Task(
                                        description="test-job-desc",
                                        existing_cluster_id=cluster_id,
                                        spark_python_task=jobs.SparkPythonTask(python_file=python_path),
                                        task_key='test-job-key',
                                        timeout_seconds=0,
                                        # Add dependant libraries like pytest
                                        libraries=[
                                            compute.Library(
                                                pypi=compute.PythonPyPiLibrary(package='pytest')
                                            )
                                        ]
                                    )
                                ])
    run_by_id = w.jobs.run_now(job_id=created_job.job_id).result()
    # # Uncomment the following section to print out details
    # for i in run_by_id.__dict__:
    #     print(i, ":", run_by_id.__dict__[i])

    # cleanup
    w.jobs.delete(job_id=created_job.job_id)


if __name__ == "__main__":
    main()

Upvotes: 0

ARCrow
ARCrow

Reputation: 1857

I resolved this by adding markdown to my python script, so Databricks recognize it as a Databricks notebook:

# Databricks notebook source

# COMMAND ----------
import pyspark.sql.functions as f

df = spark.createDataFrame([
    (1,2)
], ['test_1', 'test_2'])

Upvotes: 4

Moe
Moe

Reputation: 1

1- install in VS studio databricks-cli by typing pip install databricks-cli

From https://docs.databricks.com/dev-tools/cli/index.html

2- upload your python .py file into azure storage mounted on databricks (check how to mount azure storage on databricks)
3- connect to databricks from cli by typing in vs code terminal
Databricks configure --token
It will ask you for databricks instance URL then will ask you for personal token (you can generate that in settings in databricks check on how to generate token)

4- create databricks job instance by typing in terminal Databricks jobs create --json-file create-job.json

Contents of create-job.json

{
  "name": "SparkPi Python job",
  "new_cluster": {
    "spark_version": "7.3.x-scala2.12",
    "node_type_id": "Standard_F4",
    "num_workers": 2
  },
  "spark_python_task": {
    "python_file": "dbfs:/mnt/xxxxxx/raw/databricks-connectivity-test.py",
    "parameters": [
      "10"
    ]
  }
}

this information I gathered from youtube video below https://www.youtube.com/watch?v=XZFN0hOA8mY&ab_channel=JonWood

Upvotes: 0

Zi Dong
Zi Dong

Reputation: 61

One workaround is to create a wrapper notebook that calls this file, i.e.

from my_python_file import main
main()

Then you can schedule a job on this notebook

Upvotes: 6

Related Questions