Reputation: 1343
Me and my team are using Unity Catalog in Databricks for ease of data storage & retrieval. So far so good, until I needed to install a library for reading Excel files easily...
I've hit a pretty big roadblock, according to DB you give up the ability to use Third-part libraries in UC. Is there any possible workaround ? Spark does not have a native ability to open .xlsx files. Disabling UC would be a big set back as it makes data retrieval/acesss straightforward for other teams.
I was thinking of running a Notebook on a Non/UC cluster and somehow passing the results to the UC enabled Notebook but I think that's not possible, unless I'm missing something
EDIT: User Defined Functions which are often needed for transformations and it's set of APIs also don't work in UC Shared cluster mode
Upvotes: 4
Views: 3318
Reputation: 12375
You can enable libraries on shared Unity Catalog Clusters from the Admin settings panel:
You can also set the same workspace setting using the Workspace Conf (Enable/disable features) endpoint of Databricks APIs (more information on the API reference here ).
This is the endpoint:
/api/2.0/workspace-conf
Basically you have to create a PATCH request specifying the property you want to set in the body:
{
"enableLibraryAndInitScriptOnSharedCluster":"true"
}
Upvotes: 1
Reputation: 11
I don't know about third-party libraries but to read excel using python with shared cluster, one workaround would be to move the file to local using dbutils.fs.cp and then use pandas to read the excel file. You can then convert it to pyspark df, if needed. It would be something like the below. You would need to use a Databricks Runtime 13 or above for this to work.
#Need install openpyxl for pandas
import pandas
import os
def read_excel(dbfs_location, local_temp_location="file:///tmp/excel_file.xlsx"):
dbutils.fs.cp(dbfs_location, local_temp_location) #move to local temp file
temp_location = local_temp_location.replace("file://", "")
excel_df = pandas.read_excel(temp_location)
return excel_df #return pandas df
excel_df = read_excel("dbfs:/FileStore/tables/test/Book2.xlsx")
display(excel_df)
Upvotes: 1
Reputation: 81
I'm using AWS but for what it's worth, I'm contemplating creating a Lambda function that's triggered by the creation of a new EC2 instance. If the instance is tagged with the cluster ID of one of our Shared, UC-enabled clusters, use boto3 to run my init script.
If you like using Shared compute for jobs (we do/did so different users can view metrics), you'd have to sleep it at the beginning because it would start running before the installations are done on the Azure/AWS side.
I don't know if this would work. I'm just sharing my thoughts so they might give someone else a better idea.
Upvotes: 0
Reputation: 11
Our team has the same issue since with the traditional Hive Metastore we were using a global init script to install a bunch of external libraries on all our clusters. However, since shared mode clusters that support Unity Catalog do not support use of init scripts we moved the installation of these libraries to a Notebook that is called by other Notebooks. In our case we have a Notebook called "PythonPackages" with the following code:
try:
import openpyxl
except:
%pip install openpyxl
import openpyxl
Then we call this Notebook in the beginning of each and every other Notebook where have actual code that we want to run:
%run Shared/PythonPackages
As a result, things work for us as they used to even with Unity Catalog.
Upvotes: 1
Reputation: 323
I'm running into the same thing as we just started a proof of concept for Unity Catalog. What I have found is that limitation only applies to shared clusters as a cluster library. You can still use third party libraries as notebook scoped libraries i.e.
%python
pip install library_of_interest
I have not found a way to install a library for more than one user or for more than one notebook.
Single user clusters seem to still allow you to install cluster libraries though!
Upvotes: 3