Reputation: 11928
There are many similar questions on SO, but I simply cannot get this to work. I'm obviously missing something.
Trying to load a simple test csv file from my s3.
Doing it locally, like below, works.
from pyspark.sql import SparkSession
from pyspark import SparkContext as sc
logFile = "sparkexamplefile.csv"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
But if I add this below:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "foo")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "bar")
lines = sc.textFile("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
lines.count()
I get:
No FileSystem for scheme: s3n
I've also tried changing s3
to spark.sparkContext
without any difference
Also swapping //
and ///
in the url
Even better, I'd rather do this and go straight to data frame:
dataFrame = spark.read.csv("s3n:///mybucket-sparkexample/sparkexamplefile.csv")
Also I am slightly AWS ignorant, so I have tried s3, s3n, and s3a to no avail.
I've been around the internet and back but can't seem to resolve the scheme error. Thanks!
Upvotes: 3
Views: 8348
Reputation: 1000
You must check what is your version of hadoop*. jar files bound to your specific version of pyspark installed on your system, search for folder pyspark/jars and files hadoop*.
The version observed you pass into your pyspark file like this:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.538,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
This is bit tricky for new joiners on pyspark (I faced this directly my first day with pyspark :-)).
Otherwise I am on Gentoo system with local Spark 2.4.2. Some suggested to install also Hadoop and copy the jars directly to Spark, still should be same version as PySpark is using. So I am creating ebuild for Gentoo for these versions...
Upvotes: 2
Reputation: 3696
I think your spark environment didn't get aws jars. You need to add it for using s3 or s3n.
You have to copy required jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Here my spark version is Spark 2.3.0
and Hadoop 2.7.6
so you have to copy to jars from (hadoop dir)/share/hadoop/tools/lib/
to $SPARK_HOME/jars
.
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
Upvotes: 5