user3706794
user3706794

Reputation: 107

Install rgdal and rgeos on Azure Databricks

I cannot install rgdal and rgeos on Databricks, any suggestions?

configure: error: gdal-config not found or not executable.
ERROR: configuration failed for package ‘rgdal’
* removing ‘/databricks/spark/R/lib/rgdal’

configure: error: geos-config not found or not executable.
ERROR: configuration failed for package ‘rgeos’
* removing ‘/databricks/spark/R/lib/rgeos’

Upvotes: 3

Views: 1213

Answers (1)

TheIceBear
TheIceBear

Reputation: 3255

Here is one way to install rgdal and rgeos on R on Azure Databricks. Step 1 and 2 needs to be done each time you start the cluster. Step 1 can be automated (see below) but step 2 needs to be executed manually in a separate script or be added to the top of your R script.

Step 1

You need to first install gdal and geos on the linux machines in your cluster. This can be done with bash script in a databricks notebook. The %s is the magic command that allows this cell to run a shell script.

%sh
#!/bin/bash

#Start by updating everything
sudo apt-get update

##############
#### rgdal

#This installs gdal on the linux machine but not the R library (done in R script)
#See https://databricks.com/notebooks/rasterframes-notebook.html
sudo apt-get install -y gdal-bin libgdal-dev

#To be able to install the R library, you also need libproj-dev 
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt-get install -y libproj-dev 

##############
#### rgeos

#This installs geos on the linux machine but not the R library (done in R script)
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt install libgeos++dev

However, that is annoying to have to run manually each time, so you can create an init script that runs each time on startup of the cluster. So in a databricks python notebook, copy this code into a cell. Scripts in dbfs:/databricks/init/<name_of_cluster> will run on start-up for clusters with that name.

#This file creates a bash script called install_packages.sh. The cluster run this file on each startup.
# The bash script will be anything inside the variable script 

clusterName = "RStudioCluster"
script = """#!/bin/bash

#Start by updating everything
sudo apt-get update

##############
#### rgdal

#This installs gdal on the linux machine but not the R library (done in R script)
#See https://databricks.com/notebooks/rasterframes-notebook.html
sudo apt-get install -y gdal-bin libgdal-dev

#To be able to install the R library, you also need libproj-dev 
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt-get install -y libproj-dev 

##############
#### rgeos

#This installs geos on the linux machine but not the R library (done in R script)
#See https://philmikejones.me/tutorials/2014-07-14-installing-rgdal-in-r-on-linux/
sudo apt install libgeos++dev

"""
dbutils.fs.put("dbfs:/databricks/init/%s/install_packages.sh" % clusterName, script, True)

Step 2

So far you have just installed gdal and geos on the linux machines in the cluster. In this step you will install the R package rgdal. Recent versions of rgdal however, are not compatible with the most recent version of gdal available with apt-get. See here for more details and alternative ways to solve this, but if you are ok with an older version of rgdal then the easiest workaround is to install version 1.2-20 of rgdal. You do that in an databricks R notebook or in the Rstudio databricks app like this:

require(devtools)
install_version("rgdal", version="1.2-20")
install.packages("rgeos")

Setup done

Then you can import these libraries like usual:

library(rgdal)
library(rgeos)

Upvotes: 3

Related Questions