Reputation: 181
We are exploring aurora serverless as a database storage for one of our project. While doing the POC to fetch events from aurora serverless cluster we are receiving "Database returned more than the allowed response size limit" exception.
More details about database: Aurora PostgreSQL server less.
For doing POC We are using data APIs, which have these limitations.
import boto3;
client = boto3.client('rds-data')
import sys;
sql_statement = "select * from table_name limit 1000"
response = client.execute_statement(
database='mydb',
secretArn='<secret_arn',
resourceArn='<resource_arn>',
sql=sql_statement
)
print(response);
Upvotes: 18
Views: 9839
Reputation: 1
I want to share another example of using pandas' read_sql as mentioned by jameshgrn. I try to make it shorter and more readable. It also allows for flexibility in the query by counting the rows instead of using the table status command. However, this method may be slower if the table has a large number of rows because it requires a count(*) query.
chunksize = 10000
with engine.connect() as conn:
result = conn.execute("select count(*) from your_tbl").fetchone()
row_num = result[0]
chunks = [
pd.read_sql("select * from your_tbl limit {} offset {}".format(chunksize, i), con = conn)
for i in range(0, row_num, chunksize)
]
data_df = pd.concat(chunks, ignore_index=True)
Upvotes: 0
Reputation: 570
When you do a query in any Aurora Serverless instance through the Data API, you have two size limitations:
Currently, there are no solutions in the Data API to overcome these limits. In the Amazon Aurora User Guide, there is a recommended solution for the second problem in p.164:
In this case, the size of the result set returned by the database was too large. The Data API limit is 1 MB in the result set returned by the database. To solve this issue, make sure that calls to the Data API return 1 MB of data or less. If you need to return more than 1 MB, you can use multiple ExecuteStatement calls with the LIMIT clause in your query.
Considering it, you could do an exponential backoff until you find an acceptable LIMIT for your result set, or set a fixed LIMIT that you are comfortable that it will be always lower than 1 MB, even if your rows size increase in the future.
After defining how to set your LIMIT clause value (depending on whether your cluster uses MySQL 5.6, MySQL 5.7 or PostgreSQL 10.7), you could do a COUNT query to know how many results you will get, and then iterate until you execute COUNT / LIMIT
statements.
Another option would be to iterate until your statement response has fewer rows than your LIMIT.
Upvotes: 15
Reputation: 745
https://github.com/cloud-utils/aurora-data-api handles this problem
quick example of get all query using library
with aurora_data_api.connect(aurora_cluster_arn=cluster_arn, secret_arn=secret_arn, database="cool_db_name_here") as conn:
with conn.cursor() as cursor:
cursor.execute("select * from cool_table")
data = cursor.fetchall()
Upvotes: 0
Reputation: 26
For some reason, pandas read_sql's "chunksize" argument doesn't play well with the AWS Data API. I am using MySQL by the way, but the pydataapi should support both.
The solution I have found for this issue is to iteratively query the database using pandas' read_sql with string formatted LIMITs. This uses the pydataapi python package.
I first got the length of the table using the syntax
df_status = pd.read_sql('show table status like "yourtable";', con = sql_engine)
Then use numpy.arange() to determine starting row for request. Each query will ask for a number of rows designated by 'chunksize' parameter here. I've chosen 20,000 but you should choose the highest value that doesn't lead to a response > 1MB.
Then it's just a matter of concatenating the list of dataframes. I've found this to be the most optimal solution for now.
import pandas as pd
import numpy as np
import boto3
from sqlalchemy import create_engine
cluster_arn = 'yourcluster'
secret_arn = 'yoursecret'
database = 'mydb'
chunksize = 20000
rdsData = boto3.client('rds-data', region_name = 'yourregion')
sql_engine = create_engine('mysql+pydataapi://',
connect_args = {
'resource_arn': cluster_arn,
'secret_arn': secret_arn,
'database': database,
'client': rdsData}).connect()
df_status = pd.read_sql("show table status like 'yourtable';", con = sql_engine)
rownum = df_status.loc[0, "Rows"]
space = np.arange(0, rownum, chunksize)
space = space.tolist()
space.append(rownum)
df_list = []
for i in space:
df = pd.read_sql("select * from yourtable LIMIT {}, {}".format(i, chunksize), con = sql_engine)
df_list.append(df)
big_df = pd.concat(df_list)
sql_engine.close()
Upvotes: 1