Reputation: 153
I have around 20 tab-separated input files. They have in the region of 500 columns, but each will be slightly different.
The sink output schema is known and will contain all the possible input columns.
As a simplified example:
File 1
Name | Age | DOB | Nationality |
---|---|---|---|
Bob | 21 | 01/01/1972 | British |
File2
Name | Nationality | NINO |
---|---|---|
Joe | British | AA995654A |
File 3
Name | DOB | Nationality |
---|---|---|
Sam | 01/01/1990 | British |
Is it possible to have one DataFlow with multiple inputs, where the schema is not known until runtime, that would cope with changes in the input files and in this case would output:
Name | Age | DOB | NINO | Nationality |
---|---|---|---|---|
Bob | 21 | 01/01/1972 | NULL | British |
Joe | NULL | NULL | AA995654A | British |
Sam | NULL | 01/01/1990 | NULL | British |
I have looked at column pattern matching and schema drift, but don't see how/if it is possible to achieve this.
Upvotes: 0
Views: 680
Reputation: 3838
What you will do is to build a logical model in your data flow using a Derived Column with the common model that you wish to conform your input data to. This video shows an example of achieving this: https://www.youtube.com/watch?v=K5tgzLjEE9Q
Upvotes: 1