Reputation: 1035
I have a text field that accepts user input in the form of delimeted lists of strings. I have two main delimeters, a space and a comma.
If an item in the list contains more than one word, a user can deliniate it by enclosing it in quotes.
Sample Input:
Apple, Banana Cat, "Dog starts with a D" Elephant Fox "G is tough", "House"
Desired Output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is a tough one
House
I've been working on getting a regex for this, and I can't figure out how to allow the commas. Here is what I have so far:
Regex.Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
.Cast<Match>()
.Select(m => m.Groups["match"].Value.Replace("\"", ""))
.Where(x => x != "")
.Distinct()
.ToList()
Upvotes: 1
Views: 2069
Reputation: 3257
I like paxdiablo's parser, but if you want to use a single regex, then consider my modified version of a CSV regex parser.
Step 1: the original
string regex = "((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))";
Step 2: using multiple delimiters
char quoter = '"'; // quotation mark
string delimiter = " ,"; // either space or comma
string regex = string.Format("((?<field>[^\\r\\n{1}{0}]*)|[{1}](?<field>([^{1}]|[{1}][{1}])*)[{1}])([{0}]|(?<rowbreak>\\r\\n|\\n|$))", delimiter, quoter);
Using a simple loop to test:
Regex re = new Regex(regex);
foreach (Match m in re.Matches(input))
{
string field = m.Result("${field}").Replace("\"\"", "\"").Trim();
// string rowbreak = m.Result("${rowbreak}");
if (field != string.Empty)
{
// Print(field);
}
}
We get the output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is tough
House
That's it!
Look at the original CSV regex parser for ideas on handling the matched regex data. You might have to modify it slightly, but you'll get the idea.
Just for interest sake, if you are crazy enough to want to use multiple characters as a single delimiter, then consider this answer.
Upvotes: 0
Reputation: 61
You could perform two regexes. The first one to match the quoted sections, then remove them. With the second regex you could match the remaining words.
string pat = "\"(.*?)\"", pat2 = "(\\w+)";
string x = "Apple, Banana Cat, \"Dog starts with a D\" Elephant Fox \"G is tough\", \"House\"";
IEnumerable<Match> combined = Regex.Matches(Regex.Replace(x, pat, ""), pat2).OfType<Match>().Union(Regex.Matches(x, pat).OfType<Match>()).Where(m => m.Success);
foreach (Match m in combined)
Console.WriteLine(m.Groups[1].ToString());
Let me know if this isnt what you were looking for.
Upvotes: 0
Reputation: 881463
That regex is pretty smart if it can turn "G is tough"
into G is a tough one
:-)
On a more serious note, code up a parser and don't try to rely on a singular regex to do this for you.
You'll find you learn more, the code will be more readable, and you won't have to concern yourself with edge cases that you haven't even figured out yet, like:
Apple, Banana Cat, "Dog, not elephant, starts with a D" Elephant Fox
A simple parser for that situation would be:
state = whitespace
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace:
process word
word = ""
state = whitespace
else:
word = word + character
and it's relatively easy to add support for quoting:
state = whitespace
quote = no
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace and quote is no:
process word
word = ""
state = whitespace
else:
if character is quote:
quote = not quote
else:
word = word + character
Note that I haven't tested these thoroughly but I've done these quite a bit in the past so I'm quietly confident. It's only a short step from there to one that can also allow escaping (for example, if you want quotes within quotes like "The \" character is inside"
).
To get a single regex capable of handling multiple separators isn't that hard, getting it to monitor state, such as when you're within quotes, so you can treat separators differently, is another level.
Upvotes: 2
Reputation: 3905
You should choose between using space or commas as delimeters. Using both is a bit confusing. If that choice is not yours to make, I would grab things between quotes first. When they are gone, you can just replace all commas with spaces and split the line on spaces.
Upvotes: 0