Reputation: 1458
I have a Lake formation resource link database table, from another AWS account, of which I can query in Athena just find with permissions. But I cannot query this data in EMR. The permission access does not seem to get passed down into pyspark for some reason. I added my EMR service and instance IAM roles as Lake Formation Administrators just to bypass any Lake formation permissions I am missing.
This resource link is also an iceberg table, not sure if that changes things. This is my current spark configuration.
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.catalog.aws_glue": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.aws_glue.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.aws_glue.glue.lakeformation.enabled": "true",
"spark.sql.catalog.aws_glue.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.catalog.aws_glue.lakeformation-enabled": "true",
"spark.sql.defaultCatalog": "aws_glue"
}
}
If I list the tables for my catalog
# List tables first to verify access
logger.info("Verifying table access...")
tables = spark_session.sql(f"SHOW TABLES FROM {catalog_name}.{db_name}").collect()
logger.info(f"Available tables: {[t.tableName for t in tables]}")
I can see my tables in the logs
Verifying table access...
2025-01-16 05:42:06 INFO Available tables: ['account', 'activitydefinition',...
I have tried a couple things
# First try a simple count to verify access
logger.info("\nAttempting count query")
try:
count_df = spark_session.sql(f"""
SELECT *
FROM {catalog_name}.{db_name}.{table_name}
""")
count_df.show()
except Exception as e:
logger.error(f"Count query failed: {str(e)}")
# Try reading with minimal options
logger.info("\nAttempting main query")
try:
df = (
spark_session.read.format("iceberg")
.option("lakeformation-enabled", "true")
.option("read-identity-based-auth", "true")
.table(f"{db_name}.{table_name}")
.select("id", "identifier")
)
logger.info("Successfully created DataFrame")
df.printSchema()
return df
except Exception as e:
logger.error(f"Main query failed: {str(e)}")
# One final attempt with SQL
logger.info("\nTrying final SQL approach")
df = spark_session.sql(f"""
SELECT t.*
FROM {catalog_name}.{db_name}.{table_name} t
""")
return df
But it is always the same error.
Failed to query data: An error occurred while calling o149.sql. : software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: FRKVTCMCWA771WS7, Extended Request ID: rH0oJbyJm6IBsmZCMDlOZzbjh5hxBE5oU31zXxnxolomK4a+c4txq7iTV4I7WDsgC32qXMnEAUw=) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38) at
Upvotes: 0
Views: 60
Reputation: 1
check if both the EMR Service Role and the EC2 Instance Profile Role have permissions to access the S3 bucket storing the data. This might help.
Upvotes: 0