Noxville
Noxville

Reputation: 53

Can I create SQL tables using Azure Data Factory with a dynamic schema

I'm trying to use Azure Data Factory to take csv's and turn them into SQL tables in the DW.

The columns will change often so it need's to be dynamically taking the csv's schema.

I've tried using get metadata to get the structure and data type, but I'm unable to parse it into the relevant format to create the sql table.

has anyone done anything like this on ADF? Is it possible?

Upvotes: 5

Views: 12457

Answers (1)

Joel Cochran
Joel Cochran

Reputation: 7758

Yes - it takes a bit of configuration, but you can accomplish this with Azure Data Factory Data Flow (ADFDF).

  1. Create a DataSet pointing to your CSV location (I'm assuming Azure Blob Storage).

    • Initially, select a specific CSV file.
    • On the Schema tab, click "Import schema". It is OK that this will change later, but the DataSet must have a schema at design time.
    • On the Parameters tab, create a parameter for the blobName.
    • On the Connection tab, reference that parameter in the "File" box. You will set its value in the pipeline at runtime. [This overrides the initial value used to define the schema].
  2. Create a DataSet for the SQLDW table.

    • Select "Create new table"
    • Add the schema and table names [this should be configurable/overrideable later via DataSet parameters if needed]
    • The Schema tab will show no Schema.
  3. Create a DataFlow to move the data from CSV to SQLDW.

    • SOURCE: select the DataSet created in step 1.
      • On the Source Settings tab: Make sure "Allow schema drift" is checked and "Validate schema" is unchecked [These are the default settings].
      • CHECK "Infer drifted column types", which is NOT the default.
    • SINK: select the DataSet created in step 2.
      • On the Sink tab: Make sure "Allow schema drift" is checked and "Validate schema" is unchecked [These are the default settings].
      • On the Settings tab, change "Table action" to "Recreate table". This should infer the new schema and drop and create the columns based on what it finds.
      • On the Mappings tab: make sure "Auto Mapping" is enabled [should be by default]
  4. In the Pipeline:

    • Create a parameter for "blobName"
    • Select the Data Flow activity:
      • On the Settings tab: set the source parameter for blobName to the pipeline parameter you just created.
      • SQLDW specific: you will need to provide a Blob Storage Linked Service and location for Polybase.
  5. CAVEATS

    • From what I've seen, every column in the SQLDW table is created as NVARCHAR(MAX). I thought the "Infer drifted column types" would address this, but apparently not.
    • This configuration assumes that the first row of the CSV is a header row.
    • The Sink operation will fail if the incoming column names in the header row contain spaces or special characters. To combat this in a production scenario, you should add a SELECT in between the Source and Sink activities in the Data Flow, then use the new Rule-based mapping and expressions to strip out any invalid characters.
    • My example uses the same SQLDW schema and table name every time, but as mentioned in step 2 above, you should be able to create DataSet parameters to override those at runtime if needed.

Upvotes: 8

Related Questions