Reputation:
I have a bucket with a few small Parquet files that I would like to consolidate into a bigger one.
To do this task, I would like to create a spark job to consume and write a new file.
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession.builder \
.master("local") \
.appName("Consolidated tables") \
.getOrCreate()
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "secret")
df = spark.read.parquet("s3://lake/bronze/appx/contextb/*")
This code is throwing me an Exception: No FileSystem for scheme: s3
. If I switch to s3a://...
, I got the error: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
.
I'm trying to run this code as python myfile.py
.
Any idea on what's wrong?
Upvotes: 0
Views: 7074
Reputation: 520
from boto3.session import Session
from pyspark import SparkContext
from pyspark.sql import SparkSession, SQLContext
spark = SparkSession.builder \
.master("local") \
.appName("Consolidated tables") \
.getOrCreate()
ACCESS_KEY='your_access_key'
SECRET_KEY='your_secret_key'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
df = spark.read.parquet("s3://lake/bronze/appx/contextb/*")
Upvotes: -1
Reputation: 1932
download this hadoop-aws-2.7.5.jar (or latest version) and configure this jar available for spark
spark = SparkSession \
.builder \
.config("spark.jars", "/path/to/hadoop-aws-2.7.5.jar")\
.getOrCreate()
Upvotes: 2