How to handle double pipe delimiter in Azure Data Factory 2

Question

I'm working with a double pipe delimited file which I want to ingest in Azure SQL DB via Data Factory (2):

Column1||Column2||Column3

In my input dataset, I specified the double pipe delimiter as columnDelimiter in the format section of typeProperties:

        "format": {
            "type": "TextFormat",
            "rowDelimiter": "
",
            "columnDelimiter": "||",
            "skipLineCount": 2
        }

Unfortunately, this is not working. All rows are marked as "incompatible" and no data is ingested in my SQL table. Next, I tried a few things. When not specifying any columnDelimiter, the complete row is ingested in the first column of my SQL table, so no delimiter is considered at all. This is not correct, logically each data item must be ingested in a separate column. Scripting the || in Unicode also did not fix it. When converting my input file to single pipe delimited it works fine. However, in the end-state I'm not able to alter the input file so handling the double pipe delimiter is essential.

wBob · Accepted Answer

You could either ask the creators of your input file to use a single delimiter (like most people do) or else you have to do some pre-processing. For example, you could this with a U-SQL step which corrects the file and then import that file into your SQL DB, eg some simple U-SQL:

// Multiple column delimiters
@input =
    EXTRACT 
        col1 string
    FROM "/input/input91.txt"
    USING Extractors.Text(delimiter:'~', skipFirstNRows:1);


// Do some other processing here?
@output  = SELECT col1.Replace("||", "|") AS col2 FROM @input;


OUTPUT @output
TO "/output/output.txt"
USING Outputters.Text(delimiter : '|', quoting : false);

NB Here I've used a delimiter (tilde "~") which I know won't occur in the input file to effectively import all the columns as one column, and then fix it up.

Is there a particular reason you use two delimiters?

Also, if your file is on blob storage, then you can now BULK INSERT into Azure SQL DB, and having just given this a quick test, BULK INSERT can cope with multiple delimiters, eg

DROP TABLE IF EXISTS #tmp

CREATE TABLE #tmp (
    a VARCHAR(50),
    b VARCHAR(50),
    c VARCHAR(50),
)
GO

BULK INSERT #tmp
FROM 'D:\Data Lake\USQLDataRoot\input\input91.txt'
WITH ( FIELDTERMINATOR  = '||', DATA_SOURCE = '' );

See here for more info on the steps which involve creating an external data source and credential. You could then get Data Factory to use the Stored Proc task to execute for example.

How to handle double pipe delimiter in Azure Data Factory 2

Answers (2)

Related Questions