Reputation: 21
I am facing a problem importing a flat file into SSIS.
The file is seperated by "|" and has deliminater as ";;". However the deliminator is inconsistent. Sometimes, at the and of the rows, there is only ";" or nothing "". When importing to SSIS I get the result
Column 1 Column 2 Column 3 Column 4 Column 5
a b c d e;|a1|b1|c1|d1|e1
This should instead look like
Column 1 Column 2 Column 3 Column 4 Column 5
a b c d e
a1 b1 c1 d1 e1
And the problem arrises because in the first row there is only one or none ";". Note this is an example, many of the rows are correct and have ";;" as deliminator. I am only pointing out the problem.
The .csv file would look like
Column 1|Column 2|Column 3|Column 4|Column 5;;
a|b|c|d|e;
a1|b1|c1|d1|e1;;
and should instead look like
Column 1|Column 2|Column 3|Column 4|Column 5;;
a|b|c|d|e;;
a1|b1|c1|d1|e1;;
The data set is very big with almost 600.000 rows and 50 columns.
The first problem I face is when I import the file, since SSIS standard DataType reading is string [DT_STR]
. with a length of 50
. Since sometimes there are multiple rows with wrong deliminators, I get a very long strings in the last column cell. I Use Visual Studio, and in the Advanced Editor I changed the length to something very big.
Advanced editor in Visual studio were I have changed the length
So the question is, how do I in SSIS and Visual Studio Community separate the values in some cells in one column and split op these into a entire new row (with the already defined column variables).
I have tried manually to find all the cases where there is a error and changed this in the .csv file. After this SSIS works. However this is not a durable solution because I am getting a new file every month.
I have tried reading suggestions as:
Split a single column of data with comma delimiters into multiple columns in SSIS
but their problem is not he same, since they have a column value the replicate, and I want a entire new row.
Thanks for any help, ss
!! EDIT trying using the answers from J Weezy and R M: !!
I try to create a script task and follow that solution.
In Visual Studio, I add a script task using a Script Component and I choose "Transformation". Under Input Columns I choose all.
After this i direct the flat file source to the script component and run the code. Running the script like this (where the script component doesn't do anything) works.
Then I enter "Edit Script" in the script component, and under public override void Input0_ProcessInputRow(Input0Buffer Row)
I enter (using the help from R M):
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
public static string[] SplitLine(string input)
{
Regex lineSplit = new Regex("[0-9]\;$", RegexOptions.Compiled);
List<string> list = new List<string>();
string curr = null;
foreach (Match match in lineSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(';'));
}
return list.ToArray();
}
}
However this doesn't work (I am not even allowed to execute the task).
I have never worked with c# before so everything is new to me. As i understand the code, it search each line to find the pattern where there is numbers in front of only one ";" at the end, hence it will not find those lines which ends with numbers following by ";;" (two ;). When there is a match, one ";" is added.
Please let me know, what I am not understanding and doing wrong.
Maybe it is also wrong to put the script component after the flat file source, because adding ";" will not result in a new line, which is what I want.
Upvotes: 1
Views: 580
Reputation: 3945
Inconsistent row delimiters is bad data and there really is no way to correct for this in either the connection manager or the data flow. Fixing bad data within the data flow is not what SSIS was designed for. Your best bet is to do one of the two following:
From there, you will be able to process the file normally in SSIS.
Update 1:
If the only problem is a duplicate delimiter (;;), then read in the row and use the Replace(";;",";");
function. If you have either multiple duplicate or invalid end of row delimiters, then you are better served by using StringBuilder()
. For a solution on using StringBuilder()
, see the weblink below.
https://stackoverflow.com/a/49949787/4630376
Update 2:
One thing that I just remembered, you will need to adjust for handling only those characters that are outside of double quotes, assuming that double quotes exist within the file as the text qualifier. This is important because without it you will remove any characters that are within quotes, which may be valid data.
Upvotes: 1
Reputation: 119
I would agree with J Weezy to create a script task to correct the bad data. In the script task you could possibly use regex to deal with the “;” and “;;” issue. The script task may be your only way of dealing with the the “;” and “;;” issue.
While the below code in its current form will not work for your case, it possibly could be changed to work for your case. I have used it to deal with processing a text\csv file to correct formatting issues with each line of data. Note I got this from another post on Stackoverflow.
public static string[] SplitLine(string input)
{
Regex lineSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
List<string> list = new List<string>();
string curr = null;
foreach (Match match in lineSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}
Upvotes: 0