Jaishree Mishra
Jaishree Mishra

Reputation: 545

AWS Sagemaker Spark S3 access issue

I am new in AWS sagemaker. I created a notebook in a VPC with private subnet, kms default encrypted key, root access, no direct internet access. I have attached policy which have full access to Sagemaker and S3 in IAM as per documentations. Now while one of data scientist trying to run his code in jupyter, getting below error. I can see jar files (/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker_pyspark/jars/), I have even given access key and secret key in code, is there anything we are doing wrong here

import os
import boto3

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import sagemaker
from sagemaker import get_execution_role
import sagemaker_pyspark
import pyspark

role = get_execution_role()
spark = SparkSession.builder \
            .appName("app_name2") \
            .getOrCreate()

sc=pyspark.SparkContext.getOrCreate()
sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", 'access_key')
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", 'secret_key')
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3a.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled", "false");
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
df= spark.read.csv("s3a://mybucket/ConsolidatedData/my.csv",header="true")


Py4JJavaError: An error occurred while calling o579.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:709)

Upvotes: 0

Views: 1510

Answers (1)

Jaishree Mishra
Jaishree Mishra

Reputation: 545

Jar files were missing from /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/jars, I were looking at (/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker_pyspark/jars/. Copying file in first location solved issue.

Upvotes: 1

Related Questions