Reputation: 107
I cannot install rgdal and rgeos on Databricks, any suggestions?
configure: error: gdal-config not found or not executable.
ERROR: configuration failed for package ‘rgdal’
* removing ‘/databricks/spark/R/lib/rgdal’
configure: error: geos-config not found or not executable.
ERROR: configuration failed for package ‘rgeos’
* removing ‘/databricks/spark/R/lib/rgeos’
Upvotes: 3
Views: 1213
Reputation: 3255
Here is one way to install rgdal and rgeos on R on Azure Databricks. Step 1 and 2 needs to be done each time you start the cluster. Step 1 can be automated (see below) but step 2 needs to be executed manually in a separate script or be added to the top of your R script.
You need to first install gdal and geos on the linux machines in your cluster. This can be done with bash script in a databricks notebook. The %s
is the magic command that allows this cell to run a shell script.
%sh
#!/bin/bash
#Start by updating everything
sudo apt-get update
##############
#### rgdal
#This installs gdal on the linux machine but not the R library (done in R script)
#See https://databricks.com/notebooks/rasterframes-notebook.html
sudo apt-get install -y gdal-bin libgdal-dev
#To be able to install the R library, you also need libproj-dev
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt-get install -y libproj-dev
##############
#### rgeos
#This installs geos on the linux machine but not the R library (done in R script)
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt install libgeos++dev
However, that is annoying to have to run manually each time, so you can create an init script that runs each time on startup of the cluster. So in a databricks python notebook, copy this code into a cell. Scripts in dbfs:/databricks/init/<name_of_cluster>
will run on start-up for clusters with that name.
#This file creates a bash script called install_packages.sh. The cluster run this file on each startup.
# The bash script will be anything inside the variable script
clusterName = "RStudioCluster"
script = """#!/bin/bash
#Start by updating everything
sudo apt-get update
##############
#### rgdal
#This installs gdal on the linux machine but not the R library (done in R script)
#See https://databricks.com/notebooks/rasterframes-notebook.html
sudo apt-get install -y gdal-bin libgdal-dev
#To be able to install the R library, you also need libproj-dev
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt-get install -y libproj-dev
##############
#### rgeos
#This installs geos on the linux machine but not the R library (done in R script)
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt install libgeos++dev
"""
dbutils.fs.put("dbfs:/databricks/init/%s/install_packages.sh" % clusterName, script, True)
So far you have just installed gdal and geos on the linux machines in the cluster. In this step you will install the R package rgdal
. Recent versions of rgdal
however, are not compatible with the most recent version of gdal
available with apt-get
. See here for more details and alternative ways to solve this, but if you are ok with an older version of rgdal
then the easiest workaround is to install version 1.2-20 of rgdal
. You do that in an databricks R notebook or in the Rstudio databricks app like this:
require(devtools)
install_version("rgdal", version="1.2-20")
install.packages("rgeos")
Then you can import these libraries like usual:
library(rgdal)
library(rgeos)
Upvotes: 3