Pritesh Jain
Pritesh Jain

Reputation: 31

Setup Apache Sedona on EMR

I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies.

I tried setting up Geospark using EMR 5.33 using the Jars listed here. It didn't work as some dependencies were still missing.

I then manually set Sedona up on local, found the difference of Jars between Spark 3 and the Sedona setup and came up with following bootstrap script

#!/bin/bash
sudo pip3 install numpy
sudo pip3 install boto3 pandas findspark shapely py4j attrs
sudo pip3 install geospark --no-dependencies
sudo pip3 install apache-sedona
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-python-adapter-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-viz-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/geospark_bin/postgresql-42.2.23.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/sedona-core-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stream-2.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/orc-core-1.5.5-nohive.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-media-jaxb-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-shuffle-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/org.w3.xlink-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3:///emr_setup/spark_2.4_2.11_sedona_all_jars/minlog-1.3.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-client-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/xz-1.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/pyrolite-4.13.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-yarn-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/curator-recipes-2.6.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/aopalliance-1.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-configuration-1.6.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-beanutils-1.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/gt-metadata-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/spark-unsafe_2.11-2.4.7.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/objenesis-2.5.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-httpclient-3.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stax-api-1.0-2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hk2-api-2.4.0-b34.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/apacheds-i18n-2.0.0-M15.jar /usr/lib/spark/jars/

The EMR setup starts, but the attached notebooks to the script don't seem to be able to start. The master seems to fail for some reason.

Need help with preparing the right bootstrap script to install Apache Sedona on EMR 6.0.

Upvotes: 3

Views: 1442

Answers (2)

Rakesh B R
Rakesh B R

Reputation: 1

I had the same issues while setting up sedona on emr. This issue is more pronounced on ARM64/aarch64 architecture (used in Amazon EMR's Graviton instances - I used m6g.xlarge-clusters) because many precompiled binaries may not be available, forcing pip to try building from source. what workde for me is that I changed my cluster to m5g.xlarge which uses x86 architecture and the installation of all the dependencies was smooth. Follow this official setup documentation

Upvotes: 0

Jia Yu - Apache Sedona
Jia Yu - Apache Sedona

Reputation: 304

Here is a complete tutorial of setting up Sedona on EMR EC2.

EMR version: 6.9.0.

Installed applications: Hadoop 3.3.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark 3.3.0

I am using it together EMR Studio (notebooks).

  1. In a S3 bucket, add a script that has the following content:
#!/bin/bash

# EMR clusters only have ephemeral local storage. It does not really matter where we store the jars.
sudo mkdir /jars

# Download Sedona jar
sudo curl -o /jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/1.3.1-incubating/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar"

# Download GeoTools jar
sudo curl -o /jars/geotools-wrapper-1.3.0-27.2.jar "https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.3.0-27.2/geotools-wrapper-1.3.0-27.2.jar"

# Install necessary python libraries
sudo python3 -m pip install pandas geopandas==0.10.2
sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.3.1

When you create a EMR cluster, in the bootstrap action, specify the location of this script.

  1. When you create a EMR cluster, in the software configuration, add the following content:
[
  {
    "Classification":"spark-defaults", 
    "Properties":{
      "spark.yarn.dist.jars": "/jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar,/jars/geotools-wrapper-1.3.0-27.2.jar",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
      "spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
      }
  }
]

The key point is to use Sedona 1.3.1-incubating which can search for jars specified in spark.yarn.dist.jars property. spark.jars property is ignored for EMR on EC2 since it uses Yarn to deploy jars. See SEDONA-183

Upvotes: 3

Related Questions