Reputation: 44550
How can I split comma separated strings with quoted strings that can also contain commas?
Example input:
John, Doe, "Sid, Nency", Smith
Expected output:
Split by commas was ok, but I've got requirement that strings like "Sid, Nency" are allowed. I tried to use regexes to split such values. Regex ",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"
is from Java question and it is not working good for my .NET code. It doubles some strings, finds extra results etc.
So what is the best way to split such strings?
Upvotes: 2
Views: 2874
Reputation: 2709
If you find the regular expression too complex you can do it like this:
string initialString = "John, Doe, \"Sid, Nency\", Smith";
IEnumerable<string> splitted = initialString.Split('"');
splitted = splitted.SelectMany((str, index) => index % 2 == 0 ? str.Split(',') : new[] { str });
splitted = splitted.Where(str => !string.IsNullOrWhiteSpace(str)).Select(str => str.Trim());
Upvotes: 0
Reputation: 39437
Just go through your string. As you go through your string keep track
if you're in a "block" or not. If you're - don't treat the comma as
a comma (as a separator). Otherwise do treat it as such. It's a simple
algorithm, I would write it myself. When you encounter first " you enter
a block. When you encounter next ", you end that block you were, and so on.
So you can do it with one pass through your string.
import java.util.ArrayList;
public class Test003 {
public static void main(String[] args) {
String s = " John, , , , \" Barry, John \" , , , , , Doe, \"Sid , Nency\", Smith ";
StringBuilder term = new StringBuilder();
boolean inQuote = false;
boolean inTerm = false;
ArrayList<String> terms = new ArrayList<String>();
for (int i=0; i<s.length(); i++){
char ch = s.charAt(i);
if (ch == ' '){
if (inQuote){
if (!inTerm) {
inTerm = true;
}
term.append(ch);
}
else {
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else if (ch== '"'){
term.append(ch); // comment this out if you don't need it
if (!inTerm){
inTerm = true;
}
inQuote = !inQuote;
}else if (ch == ','){
if (inQuote){
if (!inTerm){
inTerm = true;
}
term.append(ch);
}else{
if (inTerm){
terms.add(term.toString());
term.setLength(0);
inTerm = false;
}
}
}else{
if (!inTerm){
inTerm = true;
}
term.append(ch);
}
}
if (inTerm){
terms.add(term.toString());
}
for (String t : terms){
System.out.println("|" + t + "|");
}
}
}
Upvotes: 1
Reputation: 71538
It's because of the capture group. Just turn it into a non-capture group:
",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
^^
The capture group is including the captured part in your results.
var regexObj = new Regex(@",(?=(?:[^""]*""[^""]*"")*[^""]*$)");
regexObj.Split(input).Select(s => s.Trim('\"', ' ')).ForEach(Console.WriteLine);
And just trim the results.
Upvotes: 4
Reputation: 38825
I use the following code within my Csv Parser class to achieve this:
private string[] ParseLine(string line)
{
List<string> results = new List<string>();
bool inQuotes = false;
int index = 0;
StringBuilder currentValue = new StringBuilder(line.Length);
while (index < line.Length)
{
char c = line[index];
switch (c)
{
case '\"':
{
inQuotes = !inQuotes;
break;
}
default:
{
if (c == ',' && !inQuotes)
{
results.Add(currentValue.ToString());
currentValue.Clear();
}
else
currentValue.Append(c);
break;
}
}
++index;
}
results.Add(currentValue.ToString());
return results.ToArray();
} // eo ParseLine
Upvotes: 0