Reputation: 290
I am fairly new to Databricks, so forgive me for the lack of knowledge here. I am using the Databricks resource in Azure. I mainly use the UI right now, but I know some features are only available using databricks-cli, which I have setup but not used yet.
I have cloned my Git repo in Databricks Repos using the UI. Inside my repo, there is a Python file that I will like to run as a job.
Can I use Databricks Jobs to create a job that will call this Python file directly ? The only way that I have been able to make this work is to create and upload to dbfs another Python file that will call the file in my Databricks Repo.
Maybe it cannot be done, or maybe the path I use is incorrect. I tried with the following path structure when creating a job using a Python file and it did not work, unfortunately.
file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py
Upvotes: 8
Views: 4953
Reputation: 3790
Here is an example of a way to use the databricks sdk to run a python file in a path like given above.
file:/Workspace/Repos/<user_folder>/<repo_name>/my_python_file.py
Get the latest sdk. https://pypi.org/project/databricks-sdk/
pip install databricks-sdk
Replace the text in the below variables host
, token
, python_path
and c.cluster_name
.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service import jobs, compute
import time
def main():
#auth: https://databricks-sdk-py.readthedocs.io/en/latest/authentication.html
w = WorkspaceClient(
host="https://...",
token="YOUR_TOKEN"
)
python_path = "/Repos/<user_folder>/<repo_name>/my_python_file.py"
cluster_id = None
for c in w.clusters.list():
if c.cluster_name == "CLUSTER_NAME":
cluster_id = c.cluster_id
#Create and run a job: https://databricks-sdk-py.readthedocs.io/en/latest/workspace/jobs/jobs.html
created_job = w.jobs.create(name=f'sdk-test-{time.time_ns()}',
tasks=[
jobs.Task(
description="test-job-desc",
existing_cluster_id=cluster_id,
spark_python_task=jobs.SparkPythonTask(python_file=python_path),
task_key='test-job-key',
timeout_seconds=0,
# Add dependant libraries like pytest
libraries=[
compute.Library(
pypi=compute.PythonPyPiLibrary(package='pytest')
)
]
)
])
run_by_id = w.jobs.run_now(job_id=created_job.job_id).result()
# # Uncomment the following section to print out details
# for i in run_by_id.__dict__:
# print(i, ":", run_by_id.__dict__[i])
# cleanup
w.jobs.delete(job_id=created_job.job_id)
if __name__ == "__main__":
main()
Upvotes: 0
Reputation: 1857
I resolved this by adding markdown to my python script, so Databricks recognize it as a Databricks notebook:
# Databricks notebook source
# COMMAND ----------
import pyspark.sql.functions as f
df = spark.createDataFrame([
(1,2)
], ['test_1', 'test_2'])
Upvotes: 4
Reputation: 1
1- install in VS studio databricks-cli by typing pip install databricks-cli
From https://docs.databricks.com/dev-tools/cli/index.html
2- upload your python .py file into azure storage mounted on databricks (check how to mount azure storage on databricks)
3- connect to databricks from cli by typing in vs code terminal
Databricks configure --token
It will ask you for databricks instance URL then will ask you for personal token (you can generate that in settings in databricks check on how to generate token)
4- create databricks job instance by typing in terminal Databricks jobs create --json-file create-job.json
Contents of create-job.json
{
"name": "SparkPi Python job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_F4",
"num_workers": 2
},
"spark_python_task": {
"python_file": "dbfs:/mnt/xxxxxx/raw/databricks-connectivity-test.py",
"parameters": [
"10"
]
}
}
this information I gathered from youtube video below https://www.youtube.com/watch?v=XZFN0hOA8mY&ab_channel=JonWood
Upvotes: 0
Reputation: 61
One workaround is to create a wrapper notebook that calls this file, i.e.
from my_python_file import main
main()
Then you can schedule a job on this notebook
Upvotes: 6