Andrew Lilley
Andrew Lilley

Reputation: 11

SSIS: flagging ALL the Data Quality issues in each row with Conditional Split

I have been tasked with performing Data Quality checks on data from a SQL table, whereby I export problem rows into a separate SQL table.

So far I've used a main Conditional Split that goes into derived columns: 1 per conditional split condition. It is working whereby it checks for errors, and depending on which condition is failed first, the data is output with a DQ_TYPE column populated with a certain code (e.g. DQ_001 if it had an error with the Hours condition, DQ_002 if it hit an error with the Consultant Code condition, and so on).

The problem is that I need to be able to see all of the errors within each row. For example at the moment, if Patient 101 has a row in the SQL table that has errors in all 5 columns, it'll fail the first condition in Conditional Split and 1 row will get output into my results with the code DQ_001. I would instead need it to be output 5 times, once for each error that it encountered, i.e. 1 row with DQ_001, a 2nd row with DQ_002, a 3rd row with DQ_003 and so on.

The goal is that I will use the DataQualityErrors SQL table to create an SSRS report that groups on DQ_TYPE and we can therefore Pie Chart to show the distribution of which error DQ_00X codes are most prevalent.

Is this possible using straightforward toolbox functions? Or is this only available with complex Script tasks, etc.?

Upvotes: 1

Views: 96

Answers (1)

billinkc
billinkc

Reputation: 61249

Assuming I understand your problem, I would structure this as a series of columns added to the data flow via Derived Column transformation.

Assume I have inbound like this

SELECT col1, col2, col3, col4;

My business rules

  • col1 cannot contain nulls DQ_001
  • col2 must be greater than 5 DQ_002
  • col3 must be less than 3 DQ_003
  • col4 has no rules

From my source, I would add a Derived Column Component

New Column named Pass_DQ_001 as a boolean with an expression !isnull([col1])

New Column named Pass_DQ_002 as a boolean with an expression [col2] > 5

New Column named Pass_DQ_003 as a boolean with an expression [col3] < 3

etc

At this point, your data row could look something like

NULL, 4, 4, No Rules, False, False, False
ABC, 7, 2, Still No Rules, True, True, True
...

If you have more than 3 to 4 data quality conditions, I'd add a final Derived Column component into the mix

New column IsValid as yet another boolean with an expression like Pass_DQ_001 && Pass_DQ_002 && Pass_DQ_003 etc

The penalty for adding additional columns is trivial compared to trying to debug complex expressions in a dataflow so don't do it - especially for bit columns.

At this point, you can put a data viewer in there and verify that yes, all my logic is correct. If it's wrong, you can zip in and figure out why DQ_036 isn't flagging correctly.

Otherwise, you're ready to then connect the data flow to a Conditional Split. Use our final column IsValid and things that match that go out the Output 1 path and the default/unmatched rows head to your "needs attention/failed validation" destination.

Upvotes: 1

Related Questions