Reputation: 273
I want to split a string into a list or array.
Input: green,"yellow,green",white,orange,"blue,black"
The split character is the comma (,
), but it must ignore commas inside quotes.
The output should be:
Thanks.
Upvotes: 9
Views: 70878
Reputation: 477
enclosing the regex matching within '(' and ')' and then splitting on this regex should solve this. eg: /("[^"]+")/g
Upvotes: -1
Reputation: 18980
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string input = @"green,""yellow,green"",white,orange,""blue,black""";
string splitOn = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
string[] words = Regex.Split(input, splitOn);
foreach (var word in words)
{
Console.WriteLine(word);
}
}
}
OUTPUT:
green
"yellow,green"
white
orange
"blue,black"
Upvotes: 1
Reputation: 26920
Actually this is easy enough to just use match :
string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";
try
{
Regex regexObj = new Regex(@"(?<="")\b[a-z,]+\b(?="")|[a-z]+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
Console.WriteLine("{0}", matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
}
Output :
green
yellow,green
white
orange
blue,black
Explanation :
@"
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
"" # Match the character “""” literally
)
\b # Assert position at a word boundary
[a-z,] # Match a single character present in the list below
# A character in the range between “a” and “z”
# The character “,”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
"" # Match the character “""” literally
)
| # Or match regular expression number 2 below (the entire match attempt fails if this one fails to match)
[a-z] # Match a single character in the range between “a” and “z”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
"
Upvotes: 14
Reputation: 25310
What you have there is an irregular language. In other words, the meaning of a character depends upon the sequence of characters before or after it. As the name implies Regular Expressions are for parsing Regular languages.
What you need here is a Tokenizer and Parser, a good internet search engine should guide you to examples. In fact as the tokens are just characters you probably don't even need the Tokenizer.
While you can do this simple case using a Regular Expression, it is likly to be very slow. It could also cause issues if ever the quotes arn't balanced as a regular expression would not detect this error, where as a parser would.
If you are importing a CSV file you may want to have a look at the Microsoft.VisualBasic.FileIO.TextFieldParser class (Simply add a reference to Microsoft.VisualBasic.dll in a C# project) which parses CSV files.
Another way to do this is to write your own state machine (example below) though this still does not solve the issue of a quote in the middle of a value:
using System;
using System.Text;
namespace Example
{
class Program
{
static void Main(string[] args)
{
string subjectString = @"green,""yellow,green"",white,orange,""blue,black""";
bool inQuote = false;
StringBuilder currentResult = new StringBuilder();
foreach (char c in subjectString)
{
switch (c)
{
case '\"':
inQuote = !inQuote;
break;
case ',':
if (inQuote)
{
currentResult.Append(c);
}
else
{
Console.WriteLine(currentResult);
currentResult.Clear();
}
break;
default:
currentResult.Append(c);
break;
}
}
if (inQuote)
{
throw new FormatException("Input string does not have balanced Quote Characters");
}
Console.WriteLine(currentResult);
}
}
}
Upvotes: 5
Reputation: 42095
Someone will shortly come up with an answer that does this with a single regex. I'm not that clever, but just for the sake of balance, here's a suggestion that doesn't use a regex entirely. Based on the old adage that when you try to solve a problem with a regex, you then have two problems. :)
Personally given my lack of regex-fu, I'd do one of the following:
Replace
to escape any commas inside quotes with something else (i.e. ","
). Then you can do a simple string.Split()
on the result and unescape each item in the resulting array before you use it. This is yucky. Partly because it's double-handling everything, and partly because it also uses regexes. Boooo!There's a good chance non-regex options would perform better if well-written, because regexes can be a little expensive as they scan strings internally looking for patterns.
Really, I just wanted to point out that you don't have to use a regex. :)
Here's a fairly naive implementation of my second suggestion. On my PC it's happy parsing 1 million 15-column strings in a little over 4.5 seconds.
public class ManualParser : IParser
{
public IEnumerable<string> Parse(string line)
{
if (string.IsNullOrWhiteSpace(line)) return new List<string>();
line = line.Trim();
if (line.Contains(",") == false) return new[] { line.Trim('"') };
if (line.Contains("\"") == false) return line.Split(',').Select(c => c.Trim());
bool withinQuotes = false;
var builder = new List<string>();
var trimChars = new[] { ' ', '"' };
int left = 0;
int right = 0;
for (right = 0; right < line.Length; right++)
{
char c = line[right];
if (c == '"')
{
withinQuotes = !withinQuotes;
continue;
}
if (c == ',' && !withinQuotes)
{
builder.Add(line.Substring(left, right - left).Trim(trimChars));
right++; // Jump the comma
left = right;
}
}
builder.Add(line.Substring(left, right - left).Trim(trimChars));
return builder;
}
}
Here's some unit tests for it:
[TestFixture]
public class ManualParserTests
{
[Test]
public void Parse_GivenStringWithNoQuotesAndNoCommas_ShouldReturnThatString()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is my data").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should only be one column returned");
Assert.AreEqual("This is my data", result[0], "Incorrect value is returned");
}
[Test]
public void Parse_GivenStringWithNoQuotesAndOneComma_ShouldReturnTwoColumns()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("This is, my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndNoCommas_ShouldReturnColumnWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is my data\"").ToArray();
// Assert
Assert.AreEqual(1, result.Length, "Should be 1 column returned");
Assert.AreEqual("This is my data", result[0], "Value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
[Test]
public void Parse_GivenStringWithQuotesContainingCommasAndCommas_ShouldReturnColumnsWithoutQuotes()
{
// Arrange
var parser = new ManualParser();
// Act
string[] result = parser.Parse("\"This, is\", my data").ToArray();
// Assert
Assert.AreEqual(2, result.Length, "Should be 2 columns returned");
Assert.AreEqual("This, is", result[0], "First value is incorrect");
Assert.AreEqual("my data", result[1], "Second value is incorrect");
}
}
And here's a sample app that I tested the throughput with:
class Program
{
static void Main(string[] args)
{
RunTest();
}
private static void RunTest()
{
var parser = new ManualParser();
string csv = Properties.Resources.Csv;
var result = new StringBuilder();
var s = new Stopwatch();
for (int test = 0; test < 3; test++)
{
int lineCount = 0;
s.Start();
for (int i = 0; i < 1000000 / 50; i++)
{
foreach (var line in csv.Split(new[] { Environment.NewLine }, StringSplitOptions.None))
{
string cur = line + s.ElapsedTicks.ToString();
result.AppendLine(parser.Parse(cur).ToString());
lineCount++;
}
}
s.Stop();
Console.WriteLine("Completed {0} lines in {1}ms", lineCount, s.ElapsedMilliseconds);
s.Reset();
result = new StringBuilder();
}
}
}
Upvotes: 3
Reputation: 14909
The format of the string you are trying to split appears to be standard CSV. Using a CSV parser would likely be easier/faster.
Upvotes: 3