technophile
technophile

Reputation: 37

Read data from s3 using local machine - pyspark

from pyspark.sql import SparkSession
import boto3
import os
import pandas as pd

spark = SparkSession.builder.getOrCreate()

hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", "myaccesskey")
hadoop_conf.set("fs.s3a.secret.key", "mysecretkey")
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")
hadoop_conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
hadoop_conf.set("fs.s3a.connection.ssl.enabled", "true")

conn = boto3.resource("s3", region_name="us-east-1")

df = spark.read.csv("s3a://mani-test-1206/test/test.csv", header=True)
df.show()

spark.stop()

when running above code I had below error: java.io.IOException: From option fs.s3a.aws.credentials.provider java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found

Hadoop and aws jars program is using:

spark-hadoop-distribution: spark-3.2.0-bin-hadoop3.2

hadoop jars:
hadoop-annotations-3.2.0.jar
hadoop-auth-3.2.0.jar
hadoop-aws-3.2.0.jar
hadoop-client-api-3.3.1.jar
hadoop-client-runtime-3.3.1.jar
hadoop-common-3.2.0.jar
hadoop-hdfs-3.2.0.jar

aws jars:
aws-java-sdk-1.11.624.jar
aws-java-sdk-core-1.11.624.jar
aws-java-sdk-dynamodb-1.11.624.jar
aws-java-sdk-s3-1.11.624.jar

Any help will be highly appreciated, Thanks.

Upvotes: 2

Views: 4667

Answers (2)

lubom
lubom

Reputation: 339

I had the same problem. What helps me:

  • update hadoop-aws-3.2.0 to 3.2.2 version
  • use "fs.s3a.aws.credentials.provider": "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider" (it looks name change)

Upvotes: 7

BMW
BMW

Reputation: 45333

You didn't set instance profile (one type of IAM roles) properly to the ec2 instance where you execute the codes.

so it has no proper permission to access nominted s3 bucket.

Second, review the java library if it is latest and supports to get aws credential from instance profile.

Upvotes: 0

Related Questions