Hiren Amin
Hiren Amin

Reputation: 77

Regular expression for pipe delimited and double quoted string

I have a string something like this:

"2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112"

I would like to split by pipe apart from anything wrapped in double quotes so I have something like (similar to how csv is done):

[0] => 2014-01-23 09:13:45
[1] => 10002112|TR0859657|25-DEC-2013>0000000000000001
[2] => 10002112

I would like to know if there is a regular expression that can do this?

Upvotes: 0

Views: 1556

Answers (3)

PiotrWolkowski
PiotrWolkowski

Reputation: 8782

I think you may need to write your own parser.

Yo will need:

  • custom collection to keep results

  • boolean flag to decide whether pipe is inside quotation or outside quotation marks

  • string (or StringBuilder) to keep current word

The idea is that you read string char by char. Each char is appended to the word. If there is a pipe outside quotation marks you add the word to your result collection. If there is a quote you switch a flag so you don't treat the pipe as a divider anymore but you append it as a part of the word. Then if there is another quotation you switch the flag back again. So next pipe will result in adding the whole word (with pipes within quotation marks) to the collection. I tested the code below on your example and it worked.

    private static List<string> ParseLine(string yourString)
    {
        bool ignorePipe = false;
        string word = string.Empty;

        List<string> divided = new List<string>();
        foreach (char c in yourString)
        {
            if (c == '|' &&
                !ignorePipe)
            {
                divided.Add(word);
                word = string.Empty;
            }
            else if (c == '"')
            {
                ignorePipe = !ignorePipe;
            }
            else
            {
                word += c;
            }
        }

        divided.Add(word);

        return divided;
    }

Upvotes: 2

gunr2171
gunr2171

Reputation: 17510

I'm going to blatantly ignore the fact that you want a RegEx, because I think that making your own IEnumerable will be easier. Plus, you get instant access to Linq.

var line = "2014-01-23 09:13:45|\"10002112|TR0859657|25-DEC-2013>0000000000000001\"|10002112";

var data = GetPartsFromLine(line).ToList();


private static IEnumerable<string> GetPartsFromLine(string line)
{
    int position = -1;

    while (position < line.Length)
    {
        position++;

        if (line[position] == '"')
        {
            //go find the next "
            int endQuote = line.IndexOf('"', position + 1);

            yield return line.Substring(position + 1, endQuote - position - 1);

            position = endQuote;

            if (position < line.Length && line[position + 1] == '|')
            {
                position++;
            }
        }
        else
        {
            //go find the next |
            int pipe = line.IndexOf('|', position + 1);

            if (pipe == -1)
            {
                //hit the end of the line
                yield return line.Substring(position);
                position = line.Length;
            }
            else
            {
                yield return line.Substring(position, pipe - position);
                position = pipe;
            }
        }
    }
}

This hasn't been fully tested, but it works with your example.

Upvotes: 0

Dalorzo
Dalorzo

Reputation: 20014

How about this Regular Expression:

/((["|]).*\2)/g

Online Demo

It looks like it could be used as valid split expression.

Upvotes: 0

Related Questions