Snow
Snow

Reputation: 11

How to use validate schema in mapping data flow whose source is paramaterized SAP CDC dataset

I have a requirement that we don't want to change the schema of the dataset in our destination (which is ADLS) when there is a change in source (SAP BW).

I have created a mapping data flow with a source having the SAP CDC dataset which has ODPNAME, ODPContext, loadType, and runMode parameterized.

I disabled the "Allow schema drift" option and tried two things first enable "Validate schema" and 2nd disable all options. But when I went to validate or publish the change it started throwing this error:

Schema should be defined in the dataset or enable allow schema drift

So, I went to define a schema in the dataset but there is no option for the schema in the SAP CDC dataset.

Could anyone please help me here I am stuck with this issue?

Also, the projection option is not helping as the pipeline is parameterized the ODPName and ODPContext are taken at runtime. Or if I am missing something then please do let me know.

Upvotes: 0

Views: 58

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11454

The validate schema option will not work in case of parameterized and dynamically changing schema. It can be used when you imported the projection, and the data changed later during the dataflow run.

In this case, you can use assert transformation to fail the dataflow. Follow the below dataflow design.

After all of your transformations on your source, take a derived column and create a new column source_col_names and give the expression ColumnNames() like below.

enter image description here

Now, take another source and give your target ADLS dataset for this. Similarly, take another derived Column transformation and create a new column target_col_names with same expression ColumnNames().

enter image description here

Now, compare both of these array columns using assert transformation condition like below.

Upon derivedColumn2 take assert transformation and add derivedColumn1 transformation in the Additional streams. Check on Fail data flow and select Expect exists. Give any description and in the expression select the source_col_names and target_col_names from both transformations.

enter image description here

If first row of source_col_names from derivedColumn2 i.e., the column names array of the source does not exist in the target column names array, the dataflow will fail. Otherwise, this will go to the next transformation.

Next, use a select transformation Rule-based mapping, to remove the extra created columns like below.

(name!='target_col_names')&&(name!='source_col_names')

enter image description here

Add your target dataset again in the sink. If the schema is same, it will copy the source data to the target file. Otherwise, the dataflow will fail like below.

enter image description here

Upvotes: 0

Related Questions