smooth_smoothie
smooth_smoothie

Reputation: 1343

Is there another way to use third party libraries in Unity Catalog (Databricks)

Me and my team are using Unity Catalog in Databricks for ease of data storage & retrieval. So far so good, until I needed to install a library for reading Excel files easily...

I've hit a pretty big roadblock, according to DB you give up the ability to use Third-part libraries in UC. Is there any possible workaround ? Spark does not have a native ability to open .xlsx files. Disabling UC would be a big set back as it makes data retrieval/acesss straightforward for other teams.

I was thinking of running a Notebook on a Non/UC cluster and somehow passing the results to the UC enabled Notebook but I think that's not possible, unless I'm missing something

enter image description here

EDIT: User Defined Functions which are often needed for transformations and it's set of APIs also don't work in UC Shared cluster mode

Upvotes: 4

Views: 3318

Answers (5)

Andrea
Andrea

Reputation: 12375

You can enable libraries on shared Unity Catalog Clusters from the Admin settings panel:

enter image description here

You can also set the same workspace setting using the Workspace Conf (Enable/disable features) endpoint of Databricks APIs (more information on the API reference here ).

This is the endpoint:

/api/2.0/workspace-conf

Basically you have to create a PATCH request specifying the property you want to set in the body:

{
    "enableLibraryAndInitScriptOnSharedCluster":"true"
}

Upvotes: 1

oteng
oteng

Reputation: 11

I don't know about third-party libraries but to read excel using python with shared cluster, one workaround would be to move the file to local using dbutils.fs.cp and then use pandas to read the excel file. You can then convert it to pyspark df, if needed. It would be something like the below. You would need to use a Databricks Runtime 13 or above for this to work.

#Need install openpyxl for pandas
import pandas
import os
def read_excel(dbfs_location, local_temp_location="file:///tmp/excel_file.xlsx"):
    dbutils.fs.cp(dbfs_location, local_temp_location)       #move to local temp file
    temp_location = local_temp_location.replace("file://", "")
    excel_df = pandas.read_excel(temp_location)
    return excel_df     #return pandas df

excel_df = read_excel("dbfs:/FileStore/tables/test/Book2.xlsx")
display(excel_df)

Upvotes: 1

736f5f6163636f756e74
736f5f6163636f756e74

Reputation: 81

I'm using AWS but for what it's worth, I'm contemplating creating a Lambda function that's triggered by the creation of a new EC2 instance. If the instance is tagged with the cluster ID of one of our Shared, UC-enabled clusters, use boto3 to run my init script.

If you like using Shared compute for jobs (we do/did so different users can view metrics), you'd have to sleep it at the beginning because it would start running before the installations are done on the Azure/AWS side.

I don't know if this would work. I'm just sharing my thoughts so they might give someone else a better idea.

Upvotes: 0

jastyk
jastyk

Reputation: 11

Our team has the same issue since with the traditional Hive Metastore we were using a global init script to install a bunch of external libraries on all our clusters. However, since shared mode clusters that support Unity Catalog do not support use of init scripts we moved the installation of these libraries to a Notebook that is called by other Notebooks. In our case we have a Notebook called "PythonPackages" with the following code:

try:
    import openpyxl
except:
    %pip install openpyxl
    import openpyxl

Then we call this Notebook in the beginning of each and every other Notebook where have actual code that we want to run:

%run Shared/PythonPackages

As a result, things work for us as they used to even with Unity Catalog.

Upvotes: 1

FoxHound
FoxHound

Reputation: 323

I'm running into the same thing as we just started a proof of concept for Unity Catalog. What I have found is that limitation only applies to shared clusters as a cluster library. You can still use third party libraries as notebook scoped libraries i.e.

%python
pip install library_of_interest

I have not found a way to install a library for more than one user or for more than one notebook.

Single user clusters seem to still allow you to install cluster libraries though!

Upvotes: 3

Related Questions