Reputation: 31
About 6 weeks ago (early April 2022), I had tested a Databricks workflow to ensure that I could trigger jobs on databricks remotely from Airflow, which was successful.
As part of the process the workflow activates a pre-built compute, it then loads various R libraries from DBFS into the compute, one of the packages is 'arrow', however while all the other packages load without issue this package fails to load successfully and then causes my workflow to crash.
When I look into the workflow I get the following error 'DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Command to install library [RCranPkgId(arrow,None,None)] on [0303-130414-840hkwxf] orgId [5132544506122561] failed inside Databricks infrastructure', see image below. Arrow Fail Message" data-fileid="0698Y00000JFZosQAHI have triggered workflow directly inside of databricks and still get the same problem, so clearly it does not have an airflow related cause.
I tried to delete the arrow package from dbfs to see if I could run the test without it, but everytime I delete it, it returns when I retry the workflow.
I then checked CRAN to see if arrow had been updated recently, it was on the 2022-05-09, so I loaded an older version instead, (having first deleted everything relating to it from dbfs), this didn't work either, see images attached.
# Databricks notebook source
.libPaths()
# COMMAND ----------
dir("/databricks/spark/R/lib")
# COMMAND ----------
## Add current working directory to library paths
.libPaths(c(getwd(), .libPaths()))
# COMMAND ----------
## The latest versions from CRAN
install.packages(c('arrow', 'tidyverse', 'aws.s3', 'sparklyr', 'cluster', 'sqldf', 'lubridate', 'ChannelAttribution'), repos = "http://cran.us.r-project.org")
# COMMAND ----------
dir("/tmp/Rserv/conn970")
# COMMAND ----------
## Copy from driver to DBFS
system("cp -R /tmp/Rserv/conn970 /usr/lib/R/site-library")
# COMMAND ----------
dir("/usr/lib/R/site-library")
# COMMAND ----------
## Copy from driver to DBFS
system("cp -R /tmp/Rserv/conn970 /dbfs/r-libraries")
# COMMAND ----------
dir("/dbfs/r-libraries")
# COMMAND ----------
# Add packages to libPaths
.libPaths("/dbfs/r-libraries")
# COMMAND ----------
# Check that the dbfs libraries are in libPath
.libPaths()
I have also attached the script I'm using in R to load the packages to dbfs, which I think is good as every other package load properly, however it may be of use in understanding what I am doing or why the error occurs.
What I'd like to know is:
Do you see anything that I might be doing incorrectly inside my attached libraries script?
Is there an issue with loading the arrow package and if so do you have a work around that can prevent the failure?
Why does dbfs continue to re-install arrow despite my removal of it from the directory?
Can I permanently remove the arrow package from dbfs without it returning everytime I trigger the workflow?
Many thanks in advance for any help you guys can offer.
Upvotes: 0
Views: 639
Reputation: 343
Run a cell like this to install R packages quickly. rspm
pulls binary compiled versions.
%sh
apt install -y apt-file
Rscript -e "install.packages('rspm', repos = 'https://cloud.r-project.org/')"
Rscript -e "rspm::install_sysreqs()"
Rscript -e "rspm::enable(); install.packages('arrow')"
Upvotes: 0
Reputation: 31
I figured it out, the issue had two aspects to it. The first was plain to see once I looked at the json file for the job. In it I specified that certain packages should be loaded when building the compute, this configuration overrode my model's script by virtue of the fact that it ran first. Seeing this in the json file explained why the arrow library kept trying to load regardless of the fact that I had removed it from dbfs.
The second part of the solution was how to get arrow to load without failing, which continued to happened when I retried to reinstall it on dbfs or independently on my model. By default Databricks seems to try to load packages from CRAN (https://cran.r-project.org/), I don't understand why it failed, maybe an issue with ubuntu from the last update???
The solution was to install it from a snapshot I got from MRAN at the following location 'https://cran.microsoft.com/snapshot/2022-02-24/'. Thank you user2554330, your comment got me rethinking and set me on the right direction.
I hope that this helps someone else if they are having similar issues.
Upvotes: 2