Reputation: 545
I am new in AWS sagemaker. I created a notebook in a VPC with private subnet, kms default encrypted key, root access, no direct internet access. I have attached policy which have full access to Sagemaker and S3 in IAM as per documentations. Now while one of data scientist trying to run his code in jupyter, getting below error. I can see jar files (/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker_pyspark/jars/), I have even given access key and secret key in code, is there anything we are doing wrong here
import os
import boto3
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import sagemaker
from sagemaker import get_execution_role
import sagemaker_pyspark
import pyspark
role = get_execution_role()
spark = SparkSession.builder \
.appName("app_name2") \
.getOrCreate()
sc=pyspark.SparkContext.getOrCreate()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'access_key')
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'secret_key')
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
df= spark.read.csv("s3a://mybucket/ConsolidatedData/my.csv",header="true")
Py4JJavaError: An error occurred while calling o579.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:709)
Upvotes: 0
Views: 1510
Reputation: 545
Jar files were missing from /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/jars, I were looking at (/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker_pyspark/jars/. Copying file in first location solved issue.
Upvotes: 1