Reputation: 654
I'm preparing a Data quality Report based on couple Contour analyses and would like to do a daily snapshots of the reported incorrect records. Then I want to show these daily numbers as another report in the same dashboard to see the progress on the data quality.
The main questions for me are:
Upvotes: 2
Views: 917
Reputation: 16856
Here's one process for setting up daily snapshots of a dataset derived from a Contour analysis:
Ensure that the Contour analysis results are saved as a dataset. Let's call this dataset mydataset
:
Create a Python Transform that performs daily snapshots and stores them in a dataset named mydataset_daily_snapshots
:
from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F
@transform_df(
Output("/output/path/for/mydataset_daily_snapshots"),
my_input=Input("/path/to/mydataset"),
)
def compute(my_input):
out_df = my_input.withColumn('asof_timestamp', F.current_timestamp()) # the column 'asof_timestamp' will contain the snapshot for this row on the current date
out_df = out_df.withColumn('primary_key', F.concat_ws('-', 'id', 'asof_timestamp')) # this second line is optional -- create a primary key for this row, in case you want to create an Ontology object later on for use in Workshop.
return out_df
Create Build Schedules on both mydataset
and mydataset_daily_snapshots
that build the datasets daily (or as frequently as desired), so that mydataset_daily_snapshots
will have data snapshots for each day. Ensure you check Force build
so that snapshots will always be built, even if the source data has not changed:
You can then use the mydataset_daily_snapshots
dataset within another Contour analysis to show the changes in the data over time in a Report, or create an Ontology object from it and use Workshop to show the change over time.
Something to keep in mind is that this dataset can potentially get very large very quickly -- any filtering to keep the dataset smaller (e.g. to limit snapshots to just the incorrect records or a sum of incorrect records for the day, for example) is a good idea.
Upvotes: 0