Reputation: 61
I am using the following dataset for the purposes of testing Mapping Data flows: https://github.com/fivethirtyeight/data/tree/master/avengers.
My data flow is very simple; a Source (delimited file on AzureDataLakeStorageGen2) moving to a sink (parquet file on AzureDataLakeStorageGen2). I have used the Import Projection capability to get my schema and set the value "Validate Schema". I get a projection that assumes what appear to the be the correct datatypes. It is worth noting that my dataset pointing to this source specifies all the columns as strings so there is clearly a disconnect between the import schema at dataset and that of the source Data flow transformation. See this image. Dataflow Source Projection
If I attempt to use the "Validate Schema" options I run into problems. The first issue I get is I get an error on any data type that is not a string. When using "Data preview" or run the data flow in a pipeline for example Error: at Source 'AvengersHeader': Column 'Appearances has incompatible types( Found: StringType, Required: ShortType)
. I had assumed this was due to the header row however excluding it using Skip line count with a value of 1 give me a new error e.g. Error: at Source 'AvengersHeader': Missing column 'URL
.
The Source documentation states
Validate schema: If Validate schema is selected, the data flow will fail to run if the incoming source data doesn't match the defined schema of the dataset.
I had assumed that this would look at datatypes but now wonder if this is simply the columns supplied.
For the sink documentation I it states
Validate schema: If validate schema is selected, the data flow will fail if any column of the incoming source schema isn't found in the source projection, or if the data types don't match. Use this setting to enforce that the source data meets the contract of your defined projection. It's useful in database source scenarios to signal that column names or types have changed.
I note a couple of things:
My question then is can anyone point to specific documentation that details the expected behavior and usage of this? Either official MS documentation or blogs I'm not fussy. If anyone wants to throw their 2 cents/pennies in that's fine too.
Looking at this further I have explored the derived column follow by a conditional split but this is potentially quite time consuming:
`iif(
isNull(toInteger(Appearances))==true()
|| isNull(toBoolean(case(or({Current?}=='', isNull({Current?})) == true(), 'NO',{Current?})) )== true()
|| isNull(toShort(Year)) == true()
|| isNull( toShort({Years since joining}))== true()
|| isNull(toBoolean(case(or(Death1=='', isNull(Death1)) == true(), 'NO',Death1)) )== true()
|| isNull(toBoolean(case(or(Return1=='', isNull(Return1)) == true(), 'NO',Return1)) )== true()
|| isNull(toBoolean(case(or(Death2=='', isNull(Death2)) == true(), 'NO',Death2)) )== true()
|| isNull(toBoolean(case(or(Return2=='', isNull(Return2)) == true(), 'NO',Return2)) )== true()
|| isNull(toBoolean(case(or(Death3=='', isNull(Death3)) == true(), 'NO',Death3)) )== true()
|| isNull(toBoolean(case(or(Return3=='', isNull(Return3)) == true(), 'NO',Return3)) )== true()
|| isNull(toBoolean(case(or(Death4=='', isNull(Death4)) == true(), 'NO',Death4)) )== true()
|| isNull(toBoolean(case(or(Return4=='', isNull(Return4)) == true(), 'NO',Return4)) )== true()
|| isNull(toBoolean(case(or(Death5=='', isNull(Death5)) == true(), 'NO',Death5)) )== true()
|| isNull(toBoolean(case(or(Return5=='', isNull(Return5)) == true(), 'NO',Return5)) )== true()
,1,0
)`
The Conditional Split transformation looks for rows which comply and writes them, writing malformed rows elsewhere.
While this works I'm not sure I would do that instead a dataframe using _corrupt_record feels like it would solve my problem much more easily posing the question why would I attempt to validate in Mapping Data Flows?
Upvotes: 0
Views: 1467
Reputation: 8670
This error Error: at Source 'AvengersHeader': Missing column 'URL
is due to you skip 1 line so that it can't find column named URL. You need to check 'First row as header' option in connection of dataset instead of skipping 1 line.
'Validate schema' option in the source is comparing Projecting with your schema of your dataset. If column and its type isn't same, data flow will fail.
So in your situation, I suggest you don't check 'Validate schema' option and then can work fine.
Upvotes: 0