Reputation: 31
I want to be able to use Apache Sedona for distributed GIS computing on AWS EMR. We need the right bootstrap script to have all dependencies.
I tried setting up Geospark using EMR 5.33 using the Jars listed here. It didn't work as some dependencies were still missing.
I then manually set Sedona up on local, found the difference of Jars between Spark 3 and the Sedona setup and came up with following bootstrap script
#!/bin/bash
sudo pip3 install numpy
sudo pip3 install boto3 pandas findspark shapely py4j attrs
sudo pip3 install geospark --no-dependencies
sudo pip3 install apache-sedona
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-python-adapter-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/apache-sedona-1.0.1-incubating-bin/sedona-viz-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/geospark_bin/postgresql-42.2.23.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/sedona-core-2.4_2.11-1.0.1-incubating.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stream-2.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/orc-core-1.5.5-nohive.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-media-jaxb-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-mapreduce-client-shuffle-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/org.w3.xlink-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3:///emr_setup/spark_2.4_2.11_sedona_all_jars/minlog-1.3.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/jersey-client-2.22.2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/xz-1.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/pyrolite-4.13.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hadoop-yarn-common-2.6.5.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/curator-recipes-2.6.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/aopalliance-1.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-configuration-1.6.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-beanutils-1.7.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/gt-metadata-24.0.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/spark-unsafe_2.11-2.4.7.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/objenesis-2.5.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/commons-httpclient-3.1.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/stax-api-1.0-2.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/hk2-api-2.4.0-b34.jar /usr/lib/spark/jars/
sudo aws s3 cp s3://emr_setup/spark_2.4_2.11_sedona_all_jars/apacheds-i18n-2.0.0-M15.jar /usr/lib/spark/jars/
The EMR setup starts, but the attached notebooks to the script don't seem to be able to start. The master seems to fail for some reason.
Need help with preparing the right bootstrap script to install Apache Sedona on EMR 6.0.
Upvotes: 3
Views: 1442
Reputation: 1
I had the same issues while setting up sedona on emr. This issue is more pronounced on ARM64/aarch64 architecture (used in Amazon EMR's Graviton instances - I used m6g.xlarge-clusters) because many precompiled binaries may not be available, forcing pip to try building from source. what workde for me is that I changed my cluster to m5g.xlarge which uses x86 architecture and the installation of all the dependencies was smooth. Follow this official setup documentation
Upvotes: 0
Reputation: 304
Here is a complete tutorial of setting up Sedona on EMR EC2.
EMR version: 6.9.0.
Installed applications: Hadoop 3.3.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark 3.3.0
I am using it together EMR Studio (notebooks).
#!/bin/bash
# EMR clusters only have ephemeral local storage. It does not really matter where we store the jars.
sudo mkdir /jars
# Download Sedona jar
sudo curl -o /jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar "https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/1.3.1-incubating/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar"
# Download GeoTools jar
sudo curl -o /jars/geotools-wrapper-1.3.0-27.2.jar "https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.3.0-27.2/geotools-wrapper-1.3.0-27.2.jar"
# Install necessary python libraries
sudo python3 -m pip install pandas geopandas==0.10.2
sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.3.1
When you create a EMR cluster, in the bootstrap action, specify the location of this script.
[
{
"Classification":"spark-defaults",
"Properties":{
"spark.yarn.dist.jars": "/jars/sedona-python-adapter-3.0_2.12-1.3.1-incubating.jar,/jars/geotools-wrapper-1.3.0-27.2.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
}
}
]
The key point is to use Sedona 1.3.1-incubating which can search for jars specified in spark.yarn.dist.jars
property. spark.jars
property is ignored for EMR on EC2 since it uses Yarn to deploy jars. See SEDONA-183
Upvotes: 3