Peter_07
Peter_07

Reputation: 117

Delimiter within field value SQL Server

I have a csv file which has over 20 million rows, the delimiter is the vertical bar. The issue is that there is a text column in the file which include also vertical bars within the texts, this messes up the data and cause the column to shift to the next one when importing the csv file in the SQL Server.

The file is too big to handle, e.g. if we want to add a qualifier or change the delimiter type using even fancy text editors.

Any idea? Ideally, any general solution for issues like this? Sometimes although you are using qualifiers, there might be text fields containing qualifier-like strings, delimiters, etc..

The fields are not quoted. The rows look simply like this:

field1|field2|field3|field4  
1|000|some text|some text  
2|001|some text con|taining pipe|some text  
3|002|some text|some text  

Upvotes: 2

Views: 1741

Answers (2)

David דודו Markovitz
David דודו Markovitz

Reputation: 44911

With access to bash (Linux/Unix/Cygwin etc.)


In order to estimate the severity of the issue, check the number of records with 4 fields and with other numbers of fields .

awk -F'|' '{rec[NF==4?"NF=4":"NF!=4"]++}END{for(nf in rec){print nf,rec[nf]}}' MyFile.csv

Generate a file with the good records and load it.

awk -F'|' 'NF==4{print}' MyFile.csv > MyFile_good.csv

Generate a file with the bad records and check if you can fix it manually or some other way (If you identify patterns)

awk -F'|' 'NF!=4{print}' MyFile.csv > MyFile_bad.csv

Support for qualifiers

"1"|"000"|"some text"|"some text"  
"2"|"001"|"some text con|taining pipe"|"some text"  
"3"|"002"|"some text"|"some text"  

Instead of defining a separator (awk -F'|') we are now defining how a qualified field looks like (FPAT="\"[^\"]*\"")

awk 'BEGIN{OFS="\t";FPAT="\"[^\"]*\""}{rec[NF==4?"NF=4":"NF!=4"]++}END{for(nf in rec){print nf,rec[nf]}}' MyFile.csv

Upvotes: 1

AlexK
AlexK

Reputation: 9927

You can import entire row into NVARCHAR(MAX) column and fix or parse by T-SQL

CREATE TABLE MyCSV (
    csv NVARCHAR(MAX) NULL      -- VARCHAR(MAX) NULL
)
GO

BULK INSERT   MyCSV 
FROM 'data_file' 
WITH (
    DATAFILETYPE =   'widechar' --'char' 
    ,FIELDTERMINATOR = '\r\n'
)
-- OR WITH (FORMATFILE='C:\t_floatformat-c-xml.xml'); 
GO

/*
INSERT INTO MyCSV 
VALUES 
    ('1|000|some text|some text')
    ,('2|001|some text con|taining pipe|some text')
    ,('3|002|some text|some text')
*/

ALTER TABLE MyCSV 
    ADD RowID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
GO

For parsng you can use this function:

-- SELECT * FROM [dbo].[Split2Column] (N'1|000|some text|some text', N'|')
CREATE FUNCTION [dbo].[Split2Column] (  
    @String NVARCHAR(MAX),
    @SepColumn NCHAR(1)
)
RETURNS @Columns TABLE (
     [1] NVARCHAR(MAX)
    ,[2] NVARCHAR(MAX)
    ,[3] NVARCHAR(MAX)
    ,[4] NVARCHAR(MAX)
    ,[5] NVARCHAR(MAX)
    ,[6] NVARCHAR(MAX)
    ,[7] NVARCHAR(MAX)
    ,[8] NVARCHAR(MAX)
    ,[9] NVARCHAR(MAX)
    ,[10] NVARCHAR(MAX)
)

AS
BEGIN

    ;WITH columns (cn, n1, n2 ) AS ( 
    SELECT CAST(1 as int) as cn, CAST(0 as bigint) as n1, CHARINDEX(@SepColumn, @String + @SepColumn) as n2
    UNION ALL 
    SELECT cn + 1, n2 as n1, CHARINDEX(@SepColumn, @String + @SepColumn, n2 + 1) as n2
    FROM columns
    WHERE n2 < LEN(@String)
    )

    INSERT INTO @Columns
    SELECT [1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
    FROM 
        (
        SELECT cn,
            SUBSTRING(@String, n1 + 1, n2 - n1 - 1) as val
        FROM columns) parsed
    PIVOT (
            MIN(val) FOR cn IN ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10]) 
        ) pvt
        OPTION (MAXRECURSION 0);

    RETURN;
END
GO

And here result:

SELECT MyCSV.RowID
    ,[Split2Column].*
FROM MyCSV
    CROSS APPLY [dbo].[Split2Column] (MyCSV.csv, N'|')
--WHERE [Split2Column].[5] IS NOT NULL
ORDER BY MyCSV.RowID

RowID   1   2   3               4               5           6       7       8       9       10
1       1   000 some text       some text       NULL        NULL    NULL    NULL    NULL    NULL
2       2   001 some text con   taining pipe    some text   NULL    NULL    NULL    NULL    NULL
3       3   002 some text       some text       NULL        NULL    NULL    NULL    NULL    NULL

Upvotes: 1

Related Questions