abu
abu

Reputation: 147

How to connect Superset with AWS athena?

Has anyone tried connecting superset to AWS athena ?

I was able to connect to redshift by using SQLAlchemy URI: postgresql://username:[email protected]:port/dbname

but I am having hard time connecting to AWS athena. AWS has JDBC driver (http://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html) but I can't figure out how to use it with superset. Any example ?

Upvotes: 6

Views: 13634

Answers (7)

Pratik Goenka
Pratik Goenka

Reputation: 2379

After lot of hustle managed to create connection string which works, note that all key & s3 path needs to be encoded, below format works for me

awsathena+rest://{encoded aws_access_key_id}:{encoded aws_secret_access_key}@athena.{region_name}.amazonaws.com:443/{schema_name}?s3_staging_dir={encoded s3_staging_dir}

You can use below code to generate connection string, save it to file & run

from urllib.parse import quote_plus

conn_str = "awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com:443/"\
       "{schema_name}?s3_staging_dir={s3_staging_dir}"
               
print(conn_str.format(
      aws_access_key_id=quote_plus("{aws_access_key_id}"),
      aws_secret_access_key=quote_plus("{aws_secret_access_key}"),
      region_name="{region_name}",
      schema_name="{schema_name}",
      s3_staging_dir=quote_plus("{s3_staging_dir}")))

Upvotes: 0

Official guidance from Superset:

https://superset.apache.org/docs/databases/athena

awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com/{schema_name}?s3_staging_dir={s3_staging_dir}&...

You need to have some tweaks by yourself. This worked for me after many hours of reading posts on 2021-12-12:

awsathena+rest://{secret id}:{secret access key}@athena.ap-southeast-1.amazonaws.com/test?s3_staging_dir=s3://{your bucket where Athena query result is stored}/test/&work_group=primary

Note that in my example:

"schema_name = test": You must see a database named "test" under Athena \ Query Editor \ Database at this point. It is created in Glue Console \ Data Catalog \ Database with a crawler or manual.

s3://{your bucket}/{path if needed}/test: you need to go to Athena \ Workgroups, select a workgroup and check the setting if it turned on the "Query result location" or not. In my case, the name of the workgroup is "primary", the query result of the "test" database will be stored in s3://{your bucket where Athena query result is stored}/test/

Make sure you have installed these under Python Virtual Environment:

pip install "PyAthenaJDBC>1.0.9"

pip install "PyAthena>1.2.0"

See how to create Superset under Python Env:

https://superset.apache.org/docs/installation/installing-superset-from-scratch

Security Group:

(I got this instruction from here: https://www.youtube.com/watch?v=vzuPQPRcT-0)

I build Superset on the EC2 Instance. Therefore, you need to check out the security group setting. Because it relates to EC2 service, Athena service, and the website which Superset is running in UI.

In my case, I have turned on all these settings to make sure it can run the first time. Then you can narrow down the setting later.

Custom TCP - TCP - 8088 - ::/0 ; 0.0.0.0/0

HTTP - TCP - 80 - ::/0 ; 0.0.0.0/0

SSH - TCP - 22 - ::/0 ; 0.0.0.0/0

Custom ICMP - IPv4 - Echo Request - N/A - 0.0.0.0/0

All ICMP - IPv6 - IPv6 ICMP - All - ::/0

All ICMP - IPv6 - IPv6 ICMP - All - 0.0.0.0/0

Upvotes: 0

sairahul099
sairahul099

Reputation: 39

We tried installing superset with PyAthena JDBC & REST. Our experience with PyAthena (REST) is far better than PyAthenaJDBC, would recommend to use same in production.

Install PyAthena (pure python library, java is not needed)

pip install "PyAthena>1.2.0"

Access database by creating connection url

awsathena+rest://{aws_access_key_id}:{aws_secret_access_key}@athena.{region_name}.amazonaws.com/{schema_name}?s3_staging_dir={s3_staging_dir}&...

I found this article, a good guide on deploying superset.

Upvotes: 3

Mods Vs Rockers
Mods Vs Rockers

Reputation: 1101

I got it to work using: PyAthenaJDBC (python 3.6.7) with these steps:

1) Make sure you have the PyAthenaJDBC pkg. installed:

pip install "PyAthenaJDBC>1.0.9"

2) Restart superset

3) Download the JDBC driver: from aws driver download I used the AthenaJDBC41-2.0.6.jar version

4) Add the data-source to superset:

awsathena+jdbc://AWS_KEY:[email protected]/mydb?s3_staging_dir=s3://path/to/my/data/&driver_path=/drivers/AthenaJDBC41_2.0.6.jar
Note: If superset is running on ECS / EC2 you can assign an IAM role, and remove the AWS KEY/SECRET from the URI, Example raw connection URI below:
awsathena+jdbc://{aws_key}:{aws_secret}@athena.{region_name}.amazonaws.com/{schema_name}?s3_staging_dir={s3_staging_dir}&driver_path={driver_path}
Much more info here:

Upvotes: 0

JavaNoScript
JavaNoScript

Reputation: 2413

In case someone else would come here:

awsathena+jdbc://username:[email protected]:port/dbname

This is from the superset documentation.

Upvotes: 7

Anxo P
Anxo P

Reputation: 759

You must define a property s3_staging_dir when you connect to Athena's Driver.

Example: s3_staging_dir=s3://your_bucket

Upvotes: 0

BMac
BMac

Reputation: 303

Take a look at this github PR You'll want to install PyAthenaJDBC package into pip. The driver that you are referring to is a Java driver, which is great, but Superset is largely a Python application, so it'll need a python driver to connect/interact with Athena.

The above answer is correct, but you'll want to install that package to ensure that you actually can connect to athena.

Upvotes: 1

Related Questions