PKV
PKV

Reputation: 177

Unable to read data from AWS Glue Database/Tables using Python

My requirement is to use python script to read data from AWS Glue Database into a dataframe. When I researched I fought the library - "awswrangler". I'm using the below code to connect and read data:

import awswrangler as wr

profile_name = 'aws_profile_dev'
REGION = 'us-east-1'

#Retreiving credentials to connect to AWS
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)

session = boto3.session.Session(
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=SECRET_ACCESS_KEY,
    aws_session_token=SESSION_TOKEN
)

my_df= wr.athena.read_sql_table(table= 'mytable_1', database= 'shared_db', boto3_session=session)

However, when I'm running the above code, I'm getting the following error - "ValueError: year 0 is out of range"


Alternatively, I tried using another library - "pyathena". The code I'm trying to use is:

from pyathena import connect
import pandas as pd

conn = connect(aws_access_key_id=ACCESS_KEY_ID,
                 aws_secret_access_key=SECRET_ACCESS_KEY,
                 aws_session_token=SESSION_TOKEN,
                 s3_staging_dir='s3://my-sample-bucket/',
                 region_name='us-east-1')
df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1000", conn)

Using this, I'm able to retrieve data, but it works only if I'm using limit. i.e.., If I'm just running query without limit i.e.., "select * from AwsDataCatalog.shared_db.mytable_1", it's giving the error - ValueError: year 0 is out of range

Weird behavior - For example, If I run:

df = pd.read_sql("select * from AwsDataCatalog.shared_db.mytable_1 limit 1200", conn)

sometimes it's giving the same error, and if I simply reduce the limit value and run (for example as limit 1199), and later again when I run it back with limit 1200 it works. But this doesn't work if I'm trying to read more than ~1300 rows. I have a total 2002 rows in the table. I need to read the entire table.

Please help! Thank you!

Upvotes: 1

Views: 4309

Answers (2)

PKV
PKV

Reputation: 177

I have found a way using awswrangler to query data directly from Athena into pandas dataframe on your local machine. This doesn't require us to provide output location on S3.

profile_name = 'Dev-AWS'
REGION = 'us-east-1'

#this automatically retrieves credentials from your aws credentials file after you run aws configure on command-line
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(profile_name)

session = boto3.session.Session(
    aws_access_key_id=ACCESS_KEY_ID,
    aws_secret_access_key=SECRET_ACCESS_KEY,
    aws_session_token=SESSION_TOKEN
)

wr.athena.read_sql_query("select * from table_name", database="db_name", boto3_session=session)

Alternatively, if you don't want to query Athena, but want to read entire glue table, you can use:

my_df = wr.athena.read_sql_table(table= 'my_table', database= 'my_db', boto3_session=session)

Upvotes: 0

Animesh Kansal
Animesh Kansal

Reputation: 113

Use following code in python to get data what you are looking for.

    import boto3
    query = "SELECT * from table_name"
    s3_resource = boto3.resource("s3")
    s3_client = boto3.client('s3')
    DATABASE = 'database_name'
    output='s3://output-bucket/output-folder'
    
    athena_client = boto3.client('athena')
    
        # Execution
        response = athena_client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={
                'Database': DATABASE
            },
            ResultConfiguration={
                'OutputLocation': output,
            }
        )
    
    
    queryId = response['QueryExecutionId']

Upvotes: 2

Related Questions