Reputation: 133
I thought this will be trivial but I can't get this to work.
Assume a line in a CSV file:
"Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"
string[] tokens = line.split(',')
I expect this:
"Barack Obama"
48
"President"
"1600 Penn Ave, Washington DC"
but the last token is
'Washington DC'
not
"1600 Penn Ave, Washington DC"
.
Is there an easy way to get the split function to ignore the comma within quotes?
I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.
Upvotes: 13
Views: 7516
Reputation: 19423
You might have to write your own split function.
"
character, toggle a booleanHere's an example:
public static class StringExtensions
{
public static string[] SplitQuoted(this string input, char separator, char quotechar)
{
List<string> tokens = new List<string>();
StringBuilder sb = new StringBuilder();
bool escaped = false;
foreach (char c in input)
{
if (c.Equals(separator) && !escaped)
{
// we have a token
tokens.Add(sb.ToString().Trim());
sb.Clear();
}
else if (c.Equals(separator) && escaped)
{
// ignore but add to string
sb.Append(c);
}
else if (c.Equals(quotechar))
{
escaped = !escaped;
sb.Append(c);
}
else
{
sb.Append(c);
}
}
tokens.Add(sb.ToString().Trim());
return tokens.ToArray();
}
}
Then just call:
string[] tokens = line.SplitQuoted(',','\"');
Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?
Code:
string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;
// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
start = DateTime.Now;
for (int i = 0; i<1000000;i++)
tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
Output:
1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted = 2406.25ms
Upvotes: 15
Reputation: 2714
You should be using Microsoft.VisualBasic.FileIO.TextFieldParser
for that. It will handle all the CSV stuff correctly for you, see: A similar question with example using the TextFieldParser
PS: Do not fear using the Microsoft.VisualBasic dll in a C# project, it's all .NET :-)
Upvotes: 1
Reputation: 128317
I have a SplitWithQualifier
extension method that I use here and there, which utilizes Regex
.
I make no claim as to the robustness of this code, but it has worked all right for me for a while.
// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
public static string[] SplitWithQualifier(this string text,
char delimiter,
char qualifier,
bool stripQualifierFromResult)
{
string pattern = string.Format(
@"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter.ToString()),
Regex.Escape(qualifier.ToString())
);
string[] split = Regex.Split(text, pattern);
if (stripQualifierFromResult)
return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
else
return split;
}
}
Usage:
string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);
foreach (string value in values)
Console.WriteLine(value);
Output:
Barak Obama
48
President
1600 Penn Ave, Washington DC
Upvotes: 14
Reputation: 2041
I would recommend using a regular expression instead. It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want).
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Upvotes: 0
Reputation: 1429
string temp = line.Replace( "\"", "" );
string[] tokens = temp.Split(',')
Upvotes: -2
Reputation: 4971
I see from the bigger picture that you are actually trying to parse CSV input. So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing.
One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx
I personally use it myself and like it. It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow).
Upvotes: 5
Reputation: 1202
That would be the expected behavior as quotes are just another string character in C#. Looks like what you are after is the quoted tokens or numeric tokens.
I think you might need to use Regex to split the strings unless some one else knows a better way.
Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. It's old school but may be the most reliable way in your case.
Upvotes: 0
Reputation: 75296
You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves.
Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable
:
http://www.hotblue.com/article0000.aspx?a=0006
Working with DataTables is very easy - let me know if you need a code sample for that.
Upvotes: 0
Reputation: 532
Can't you change how the CSV is generated? Using OpenOffice, you can set the char separator (use ;) and how the string is delimited (using " or ').
It would be like this: 'President';'1600 Penn Ave, Washington DC'
Upvotes: -1