DataSteve
DataSteve

Reputation: 61

"Validate Schema" In Source and Sink

I am using the following dataset for the purposes of testing Mapping Data flows: https://github.com/fivethirtyeight/data/tree/master/avengers.

My data flow is very simple; a Source (delimited file on AzureDataLakeStorageGen2) moving to a sink (parquet file on AzureDataLakeStorageGen2). I have used the Import Projection capability to get my schema and set the value "Validate Schema". I get a projection that assumes what appear to the be the correct datatypes. It is worth noting that my dataset pointing to this source specifies all the columns as strings so there is clearly a disconnect between the import schema at dataset and that of the source Data flow transformation. See this image. Dataflow Source Projection

If I attempt to use the "Validate Schema" options I run into problems. The first issue I get is I get an error on any data type that is not a string. When using "Data preview" or run the data flow in a pipeline for example Error: at Source 'AvengersHeader': Column 'Appearances has incompatible types( Found: StringType, Required: ShortType). I had assumed this was due to the header row however excluding it using Skip line count with a value of 1 give me a new error e.g. Error: at Source 'AvengersHeader': Missing column 'URL.

The Source documentation states

Validate schema: If Validate schema is selected, the data flow will fail to run if the incoming source data doesn't match the defined schema of the dataset.

I had assumed that this would look at datatypes but now wonder if this is simply the columns supplied.

For the sink documentation I it states

Validate schema: If validate schema is selected, the data flow will fail if any column of the incoming source schema isn't found in the source projection, or if the data types don't match. Use this setting to enforce that the source data meets the contract of your defined projection. It's useful in database source scenarios to signal that column names or types have changed.

I note a couple of things:

  1. This specifies changes in the source
  2. It specifies that source type don't match however this does not appear to do that.

My question then is can anyone point to specific documentation that details the expected behavior and usage of this? Either official MS documentation or blogs I'm not fussy. If anyone wants to throw their 2 cents/pennies in that's fine too.

Looking at this further I have explored the derived column follow by a conditional split but this is potentially quite time consuming:

`iif(
    isNull(toInteger(Appearances))==true()
||  isNull(toBoolean(case(or({Current?}=='', isNull({Current?})) == true(), 'NO',{Current?})) )== true()    
||  isNull(toShort(Year)) == true()
||  isNull( toShort({Years since joining}))== true()
||  isNull(toBoolean(case(or(Death1=='', isNull(Death1)) == true(), 'NO',Death1)) )== true()  
||  isNull(toBoolean(case(or(Return1=='', isNull(Return1)) == true(), 'NO',Return1)) )== true() 
||  isNull(toBoolean(case(or(Death2=='', isNull(Death2)) == true(), 'NO',Death2)) )== true()  
||  isNull(toBoolean(case(or(Return2=='', isNull(Return2)) == true(), 'NO',Return2)) )== true() 
||  isNull(toBoolean(case(or(Death3=='', isNull(Death3)) == true(), 'NO',Death3)) )== true()  
||  isNull(toBoolean(case(or(Return3=='', isNull(Return3)) == true(), 'NO',Return3)) )== true() 
||  isNull(toBoolean(case(or(Death4=='', isNull(Death4)) == true(), 'NO',Death4)) )== true()  
||  isNull(toBoolean(case(or(Return4=='', isNull(Return4)) == true(), 'NO',Return4)) )== true()  
||  isNull(toBoolean(case(or(Death5=='', isNull(Death5)) == true(), 'NO',Death5)) )== true()  
||  isNull(toBoolean(case(or(Return5=='', isNull(Return5)) == true(), 'NO',Return5)) )== true() 
    ,1,0
)`

The Conditional Split transformation looks for rows which comply and writes them, writing malformed rows elsewhere.
While this works I'm not sure I would do that instead a dataframe using _corrupt_record feels like it would solve my problem much more easily posing the question why would I attempt to validate in Mapping Data Flows?

Upvotes: 0

Views: 1467

Answers (1)

Steve Johnson
Steve Johnson

Reputation: 8670

This error Error: at Source 'AvengersHeader': Missing column 'URL is due to you skip 1 line so that it can't find column named URL. You need to check 'First row as header' option in connection of dataset instead of skipping 1 line. enter image description here

'Validate schema' option in the source is comparing Projecting with your schema of your dataset. If column and its type isn't same, data flow will fail. enter image description here

enter image description here

So in your situation, I suggest you don't check 'Validate schema' option and then can work fine.

enter image description here

Upvotes: 0

Related Questions