Haha
Haha

Reputation: 1009

How to get rows that failed CustomSql data quality check in AWS Glue

According to this documentation page, AWS Glue can now detect rows that failed a CustomSql data quality check.

I tried it and I am not seeing the rows that failed but only a % of failed data. Here is my code:

from pyspark.sql import SparkSession
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsgluedq.transforms import EvaluateDataQuality
from pyspark.sql import SparkSession, Row
from pyspark.sql.dataframe import DataFrame


spark = SparkSession.builder.getOrCreate() 
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

glue_ctx = GlueContext(spark.sparkContext)

df = spark.createDataFrame([Row(campagne='dzdzd', num=5),Row(campagne='["campagne1", "campagne2"]', num=5),Row(campagne='["campagne1", "campagne2"]', num=1)])
dynamic_df = DynamicFrame.fromDF(df, glue_ctx)

EvaluateDataQuality.apply(
        frame=dynamic_df,
        ruleset="""
    Rules = [
        CustomSql "select campagne from primary where num = 5"
    ]
    """,
        publishing_options={}).toDF().show(truncate=False)

This returning the following DataFrame:

+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|Rule                                                  |Outcome|FailureReason                                      |EvaluatedMetrics                                      |EvaluatedRule                                         |
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+
|CustomSql "select campagne from primary where num = 5"|Failed |Custom SQL response failed to satisfy the threshold|{Dataset.*.CustomSQL.Compliance -> 0.6666666666666666}|CustomSql "select campagne from primary where num = 5"|
+------------------------------------------------------+-------+---------------------------------------------------+------------------------------------------------------+------------------------------------------------------+

Obviously, I would like to have the rows that failed and only a %.

Upvotes: 0

Views: 108

Answers (1)

Haha
Haha

Reputation: 1009

For future reference:

This is the answer for my question from AWS documentation:

EvaluateDataQualityMultiframe = EvaluateDataQuality().process_rows(
    frame= dynamic_df,
    ruleset="""
    Rules = [
        CustomSql "select campagne from primary where num = 5"
    ]
    """,,
    publishing_options={
        "dataQualityEvaluationContext": "EvaluateDataQualityMultiframe",
        "enableDataQualityCloudWatchMetrics": False,
        "enableDataQualityResultsPublishing": False,
    },
    additional_options={"performanceTuning.caching": "CACHE_NOTHING"},
)    

RowLevelOutcomes = SelectFromCollection.apply(
dfc=EvaluateDataQualityMultiframe,
key="rowLevelOutcomes",
transformation_ctx="rowLevelOutcomes",
).toDF().show(truncate=False)

Upvotes: 0

Related Questions