Reputation: 11
I have been tasked with performing Data Quality checks on data from a SQL table, whereby I export problem rows into a separate SQL table.
So far I've used a main Conditional Split that goes into derived columns: 1 per conditional split condition. It is working whereby it checks for errors, and depending on which condition is failed first, the data is output with a DQ_TYPE column populated with a certain code (e.g. DQ_001 if it had an error with the Hours condition, DQ_002 if it hit an error with the Consultant Code condition, and so on).
The problem is that I need to be able to see all of the errors within each row. For example at the moment, if Patient 101 has a row in the SQL table that has errors in all 5 columns, it'll fail the first condition in Conditional Split and 1 row will get output into my results with the code DQ_001. I would instead need it to be output 5 times, once for each error that it encountered, i.e. 1 row with DQ_001, a 2nd row with DQ_002, a 3rd row with DQ_003 and so on.
The goal is that I will use the DataQualityErrors SQL table to create an SSRS report that groups on DQ_TYPE and we can therefore Pie Chart to show the distribution of which error DQ_00X codes are most prevalent.
Is this possible using straightforward toolbox functions? Or is this only available with complex Script tasks, etc.?
Upvotes: 1
Views: 96
Reputation: 61249
Assuming I understand your problem, I would structure this as a series of columns added to the data flow via Derived Column transformation.
Assume I have inbound like this
SELECT col1, col2, col3, col4;
My business rules
From my source, I would add a Derived Column Component
New Column named Pass_DQ_001
as a boolean with an expression !isnull([col1])
New Column named Pass_DQ_002
as a boolean with an expression [col2] > 5
New Column named Pass_DQ_003
as a boolean with an expression [col3] < 3
etc
At this point, your data row could look something like
NULL, 4, 4, No Rules, False, False, False
ABC, 7, 2, Still No Rules, True, True, True
...
If you have more than 3 to 4 data quality conditions, I'd add a final Derived Column component into the mix
New column IsValid
as yet another boolean with an expression like Pass_DQ_001 && Pass_DQ_002 && Pass_DQ_003
etc
The penalty for adding additional columns is trivial compared to trying to debug complex expressions in a dataflow so don't do it - especially for bit columns.
At this point, you can put a data viewer in there and verify that yes, all my logic is correct. If it's wrong, you can zip in and figure out why DQ_036 isn't flagging correctly.
Otherwise, you're ready to then connect the data flow to a Conditional Split. Use our final column IsValid
and things that match that go out the Output 1 path and the default/unmatched rows head to your "needs attention/failed validation" destination.
Upvotes: 1