Reputation: 117
I have a csv file which has over 20 million rows, the delimiter is the vertical bar. The issue is that there is a text column in the file which include also vertical bars within the texts, this messes up the data and cause the column to shift to the next one when importing the csv file in the SQL Server.
The file is too big to handle, e.g. if we want to add a qualifier or change the delimiter type using even fancy text editors.
Any idea? Ideally, any general solution for issues like this? Sometimes although you are using qualifiers, there might be text fields containing qualifier-like strings, delimiters, etc..
The fields are not quoted. The rows look simply like this:
field1|field2|field3|field4
1|000|some text|some text
2|001|some text con|taining pipe|some text
3|002|some text|some text
Upvotes: 2
Views: 1741
Reputation: 44911
With access to bash (Linux/Unix/Cygwin etc.)
In order to estimate the severity of the issue, check the number of records with 4 fields and with other numbers of fields .
awk -F'|' '{rec[NF==4?"NF=4":"NF!=4"]++}END{for(nf in rec){print nf,rec[nf]}}' MyFile.csv
Generate a file with the good records and load it.
awk -F'|' 'NF==4{print}' MyFile.csv > MyFile_good.csv
Generate a file with the bad records and check if you can fix it manually or some other way (If you identify patterns)
awk -F'|' 'NF!=4{print}' MyFile.csv > MyFile_bad.csv
"1"|"000"|"some text"|"some text"
"2"|"001"|"some text con|taining pipe"|"some text"
"3"|"002"|"some text"|"some text"
Instead of defining a separator (awk -F'|'
) we are now defining how a qualified field looks like (FPAT="\"[^\"]*\""
)
awk 'BEGIN{OFS="\t";FPAT="\"[^\"]*\""}{rec[NF==4?"NF=4":"NF!=4"]++}END{for(nf in rec){print nf,rec[nf]}}' MyFile.csv
Upvotes: 1
Reputation: 9927
You can import entire row into NVARCHAR(MAX) column and fix or parse by T-SQL
CREATE TABLE MyCSV (
csv NVARCHAR(MAX) NULL -- VARCHAR(MAX) NULL
)
GO
BULK INSERT MyCSV
FROM 'data_file'
WITH (
DATAFILETYPE = 'widechar' --'char'
,FIELDTERMINATOR = '\r\n'
)
-- OR WITH (FORMATFILE='C:\t_floatformat-c-xml.xml');
GO
/*
INSERT INTO MyCSV
VALUES
('1|000|some text|some text')
,('2|001|some text con|taining pipe|some text')
,('3|002|some text|some text')
*/
ALTER TABLE MyCSV
ADD RowID INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED
GO
For parsng you can use this function:
-- SELECT * FROM [dbo].[Split2Column] (N'1|000|some text|some text', N'|')
CREATE FUNCTION [dbo].[Split2Column] (
@String NVARCHAR(MAX),
@SepColumn NCHAR(1)
)
RETURNS @Columns TABLE (
[1] NVARCHAR(MAX)
,[2] NVARCHAR(MAX)
,[3] NVARCHAR(MAX)
,[4] NVARCHAR(MAX)
,[5] NVARCHAR(MAX)
,[6] NVARCHAR(MAX)
,[7] NVARCHAR(MAX)
,[8] NVARCHAR(MAX)
,[9] NVARCHAR(MAX)
,[10] NVARCHAR(MAX)
)
AS
BEGIN
;WITH columns (cn, n1, n2 ) AS (
SELECT CAST(1 as int) as cn, CAST(0 as bigint) as n1, CHARINDEX(@SepColumn, @String + @SepColumn) as n2
UNION ALL
SELECT cn + 1, n2 as n1, CHARINDEX(@SepColumn, @String + @SepColumn, n2 + 1) as n2
FROM columns
WHERE n2 < LEN(@String)
)
INSERT INTO @Columns
SELECT [1],[2],[3],[4],[5],[6],[7],[8],[9],[10]
FROM
(
SELECT cn,
SUBSTRING(@String, n1 + 1, n2 - n1 - 1) as val
FROM columns) parsed
PIVOT (
MIN(val) FOR cn IN ([1],[2],[3],[4],[5],[6],[7],[8],[9],[10])
) pvt
OPTION (MAXRECURSION 0);
RETURN;
END
GO
And here result:
SELECT MyCSV.RowID
,[Split2Column].*
FROM MyCSV
CROSS APPLY [dbo].[Split2Column] (MyCSV.csv, N'|')
--WHERE [Split2Column].[5] IS NOT NULL
ORDER BY MyCSV.RowID
RowID 1 2 3 4 5 6 7 8 9 10
1 1 000 some text some text NULL NULL NULL NULL NULL NULL
2 2 001 some text con taining pipe some text NULL NULL NULL NULL NULL
3 3 002 some text some text NULL NULL NULL NULL NULL NULL
Upvotes: 1