ritu
ritu

Reputation: 133

Split String in C#

I thought this will be trivial but I can't get this to work.

Assume a line in a CSV file: "Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"

string[] tokens = line.split(',')

I expect this:

 "Barack Obama"
 48
 "President"
 "1600 Penn Ave, Washington DC"

but the last token is 'Washington DC' not "1600 Penn Ave, Washington DC".

Is there an easy way to get the split function to ignore the comma within quotes?

I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.

Upvotes: 13

Views: 7516

Answers (9)

Damovisa
Damovisa

Reputation: 19423

You might have to write your own split function.

  • Iterate through each char in the string
  • When you hit a " character, toggle a boolean
  • When you hit a comma, if the bool is true, ignore it, else, you have your token

Here's an example:

public static class StringExtensions
{
    public static string[] SplitQuoted(this string input, char separator, char quotechar)
    {
        List<string> tokens = new List<string>();

        StringBuilder sb = new StringBuilder();
        bool escaped = false;
        foreach (char c in input)
        {
            if (c.Equals(separator) && !escaped)
            {
                // we have a token
                tokens.Add(sb.ToString().Trim());
                sb.Clear();
            }
            else if (c.Equals(separator) && escaped)
            {
                // ignore but add to string
                sb.Append(c);
            }
            else if (c.Equals(quotechar))
            {
                escaped = !escaped;
                sb.Append(c);
            }
            else
            {
                sb.Append(c);
            }
        }
        tokens.Add(sb.ToString().Trim());

        return tokens.ToArray();
    }
}

Then just call:

string[] tokens = line.SplitQuoted(',','\"');

Benchmarks

Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?

Code:

string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;

// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
    tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

start = DateTime.Now;
for (int i = 0; i<1000000;i++)
    tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted =        {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

Output:

1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted =        2406.25ms

Upvotes: 15

Paul Kohler
Paul Kohler

Reputation: 2714

You should be using Microsoft.VisualBasic.FileIO.TextFieldParser for that. It will handle all the CSV stuff correctly for you, see: A similar question with example using the TextFieldParser

PS: Do not fear using the Microsoft.VisualBasic dll in a C# project, it's all .NET :-)

Upvotes: 1

Dan Tao
Dan Tao

Reputation: 128317

I have a SplitWithQualifier extension method that I use here and there, which utilizes Regex.

I make no claim as to the robustness of this code, but it has worked all right for me for a while.

// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
    public static string[] SplitWithQualifier(this string text,
                                              char delimiter,
                                              char qualifier,
                                              bool stripQualifierFromResult)
    {
        string pattern = string.Format(
            @"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
            Regex.Escape(delimiter.ToString()),
            Regex.Escape(qualifier.ToString())
        );

        string[] split = Regex.Split(text, pattern);

        if (stripQualifierFromResult)
            return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
        else
            return split;
    }
}

Usage:

string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);

foreach (string value in values)
    Console.WriteLine(value);

Output:

Barak Obama
48
President
1600 Penn Ave, Washington DC

Upvotes: 14

Ishpeck
Ishpeck

Reputation: 2041

I would recommend using a regular expression instead. It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want).

http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

http://oreilly.com/windows/archive/csharp-regular-expressions.html

Upvotes: 0

kmontgom
kmontgom

Reputation: 1429

string temp = line.Replace( "\"", "" );

string[] tokens = temp.Split(',')

Upvotes: -2

Amry
Amry

Reputation: 4971

I see from the bigger picture that you are actually trying to parse CSV input. So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing.

A Fast CSV Reader

One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx

I personally use it myself and like it. It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow).

Upvotes: 5

eaglestorm
eaglestorm

Reputation: 1202

That would be the expected behavior as quotes are just another string character in C#. Looks like what you are after is the quoted tokens or numeric tokens.

I think you might need to use Regex to split the strings unless some one else knows a better way.

Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. It's old school but may be the most reliable way in your case.

Upvotes: 0

MusiGenesis
MusiGenesis

Reputation: 75296

You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves.

Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable:

http://www.hotblue.com/article0000.aspx?a=0006

Working with DataTables is very easy - let me know if you need a code sample for that.

Upvotes: 0

Erup
Erup

Reputation: 532

Can't you change how the CSV is generated? Using OpenOffice, you can set the char separator (use ;) and how the string is delimited (using " or ').

It would be like this: 'President';'1600 Penn Ave, Washington DC'

Upvotes: -1

Related Questions