DanG
DanG

Reputation: 741

Connect AWS S3 to Databricks PySpark

I'm trying to connect and read all my csv files from s3 bucket with databricks pyspark. When I am using some bucket that I have admin access , it works without error

data_path = 's3://mydata_path_with_adminaccess/'

But when I tried to connect to some bucket which needs ACCESS_KEY_ID and SECRET_ACCESS_KEY , It will not work and access is denied :

I tried :

data_path = 's3://mydata_path_without_adminaccess/'

AWS_ACCESS_KEY_ID='my key'
AWS_SECRET_ACCESS_KEY='my key'

and:

data_path = ='s3://<MY_ACCESS_KEY_ID>:<My_SECRET_ACCESS_KEY>@mydata_path_without_adminaccess

Upvotes: 2

Views: 4942

Answers (2)

raam
raam

Reputation: 41

To connect S3 with databricks using access-key, you can simply mount S3 on databricks. It creates a pointer to your S3 bucket in databricks. If you already have a secret stored in databricks, Retrieve it as below:

access_key = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
secret_key = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")

If you do not have a secret stored in Databricks, try below piece of code to avoid "Secret does not exist with scope" error

access_key = "your-access-key"
secret_key = "your-secret-key"

#Mount bucket on databricks
encoded_secret_key = secret_key.replace("/", "%2F")
aws_bucket_name = "s3-bucket-name"
mount_name = "Mount-Name"
dbutils.fs.mount("s3a://%s:%s@%s" % (access_key, encoded_secret_key, aws_bucket_name), "/mnt/%s" % mount_name)
display(dbutils.fs.ls("/mnt/%s" % mount_name))

Access your S3 data as below

mount_name = "mount-name"
file_name="file name "
df = spark.read.text("/mnt/%s/%s" % (mount_name , file_name))
df.show()

Upvotes: 2

Mohit Verma
Mohit Verma

Reputation: 5296

I am not really sure if you have tried mounting your bucket in databricks using secret and keys , but it's worth trying:

Here is the code for the same:

ACCESS_KEY = dbutils.secrets.get(scope = "aws", key = "aws-access-key")
SECRET_KEY = dbutils.secrets.get(scope = "aws", key = "aws-secret-key")
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "<aws-bucket-name>"
MOUNT_NAME = "<mount-name>"

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))

and then you can access files in your S3 bucket as if they were local files:

df = spark.read.text("/mnt/%s/...." % MOUNT_NAME)

Additional reference:

https://docs.databricks.com/data/data-sources/aws/amazon-s3.html

Hope it helps.

Upvotes: 4

Related Questions