Patterson
Patterson

Reputation: 2757

Unable to call a function in Apache Spark with Databricks

I have limited knowledge on Python and Python functions. However, I believe I have a grasp of the fundamentals of Python. I was provided with the function that I have imported into a module called entity

When I try to call the function I get the error:

NameError: name 'dbutils' is not defined

The full error is as follows:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<command-2967368096587460> in <module>
----> 1 entity.rename_file(stagingLocation+"/tempDelta",saveloc+"/past_files","csv","mytest")

/databricks/python/lib/python3.8/site-packages/hydr8/cln/entity.py in rename_file(origin_path, dest_path, file_type, new_name)
    235 
    236 def rename_file(origin_path, dest_path, file_type, new_name):
--> 237   filelist = dbutils.fs.ls(origin_path)#list all files from origin path
    238   filtered_filelist = [x.name for x in filelist if x.name.endswith("."+file_type)]#keep names of the files that match the type requested
    239   if len(filtered_filelist) > 1:#check if we have more than 1 files of that type

NameError: name 'dbutils' is not defined

I'm attempting to call the function in Databricks notebook with the folloiwng code:

entity.rename_file(parameter1,parameter2,parameter3,parameter4")

The function is as follows:

def rename_file(origin_path, dest_path, file_type, new_name):
  filelist = dbutils.fs.ls(origin_path)#list all files from origin path
  filtered_filelist = [x.name for x in filelist if x.name.endswith("."+file_type)]#keep names of the files that match the type requested
  if len(filtered_filelist) > 1:#check if we have more than 1 files of that type
    print("Too many "+file_type+" files. You will need a different implementation")
  elif len(filtered_filelist) == 0: #check if there are no files of that type
    print("No "+file_type+" files found")
  else:
    dbutils.fs.mv(origin_path+"/"+filtered_filelist[0], dest_path+"/"+new_name+"."+file_type)#move the file to a new path (can be the same) changing the name in the process

The function was imported into the module from VSCode as a Python wheel as follows:

enter image description here

Do I need to define dbutils within VSCode? Because if I run the function directly from Databricks and then call the function as follows:

rename_file(parameter1,parameter2,parameter3,parameter4")

the function runs perfectly fine.

Upvotes: 2

Views: 826

Answers (1)

Hubert Dudek
Hubert Dudek

Reputation: 1722

To develop code in Visual Studio you need to use databricks-connect library. It will execute your code on Spark cluster.

However it have serious of limitations:

  • code is executed on cluster not on databricks,
  • only some version of runtime are supported,
  • you need to have the same minor version of python on your machine as runtime on server,
  • several functions (like streaming) are not supported (but dbutils which you mention is supported)

More information here: https://docs.databricks.com/dev-tools/databricks-connect.html#requirements

Community is aware of this limitations that's why in early 2022 databricks-tunnel will be available, which will run your code on databricks cloud not directly on cluster. There will be ready extensions for PyCharm and VS Code. Below is picture from roadmap meeting last week:

enter image description here

Upvotes: 1

Related Questions