How do you get AWS EMR to access a Lake Formation Resource Link Table

Question

I have a Lake formation resource link database table, from another AWS account, of which I can query in Athena just find with permissions. But I cannot query this data in EMR. The permission access does not seem to get passed down into pyspark for some reason. I added my EMR service and instance IAM roles as Lake Formation Administrators just to bypass any Lake formation permissions I am missing.

This resource link is also an iceberg table, not sure if that changes things. This is my current spark configuration.

{
  "Classification": "spark-defaults",
  "Properties": {
    "spark.sql.catalog.aws_glue": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.aws_glue.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.aws_glue.glue.lakeformation.enabled": "true",
    "spark.sql.catalog.aws_glue.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.aws_glue.lakeformation-enabled": "true",
    "spark.sql.defaultCatalog": "aws_glue"
  }
}

If I list the tables for my catalog

# List tables first to verify access
logger.info("Verifying table access...")
tables = spark_session.sql(f"SHOW TABLES FROM {catalog_name}.{db_name}").collect()
logger.info(f"Available tables: {[t.tableName for t in tables]}")

I can see my tables in the logs

 Verifying table access...
2025-01-16 05:42:06 INFO     Available tables: ['account', 'activitydefinition',...

I have tried a couple things

# First try a simple count to verify access
logger.info("
Attempting count query")
try:
    count_df = spark_session.sql(f"""
        SELECT *
        FROM {catalog_name}.{db_name}.{table_name}
    """)
    count_df.show()
except Exception as e:
    logger.error(f"Count query failed: {str(e)}")
# Try reading with minimal options
logger.info("
Attempting main query")
try:
    df = (
        spark_session.read.format("iceberg")
        .option("lakeformation-enabled", "true")
        .option("read-identity-based-auth", "true")
        .table(f"{db_name}.{table_name}")
        .select("id", "identifier")
    )
    logger.info("Successfully created DataFrame")
    df.printSchema()
    return df
except Exception as e:
    logger.error(f"Main query failed: {str(e)}")
    # One final attempt with SQL
    logger.info("
Trying final SQL approach")
    df = spark_session.sql(f"""
        SELECT t.*
        FROM {catalog_name}.{db_name}.{table_name} t
    """)
    return df

But it is always the same error.

Failed to query data: An error occurred while calling o149.sql. : software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: FRKVTCMCWA771WS7, Extended Request ID: rH0oJbyJm6IBsmZCMDlOZzbjh5hxBE5oU31zXxnxolomK4a+c4txq7iTV4I7WDsgC32qXMnEAUw=) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleErrorResponse(CombinedResponseHandler.java:125) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handleResponse(CombinedResponseHandler.java:82) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:60) at software.amazon.awssdk.core.internal.http.CombinedResponseHandler.handle(CombinedResponseHandler.java:41) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50) at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38) at

How do you get AWS EMR to access a Lake Formation Resource Link Table

Answers (1)

Related Questions