Troels Larsen
Troels Larsen

Reputation: 4631

Regex - Escape escape characters

My problem is quite complex, but can be boiled down to a simple example.

I am writing a custom query language where users can input strings which I parse to LinQ Expressions.

What I would like to able to do is to split strings by the * character, unless it is correctly escaped.

Input         Output                          Query Description
"*\\*"    --> { "*", "\\", "*" }       -- contains a '\'
"*\\\**"  --> { "*", "\\\*", "*" }     -- contains '\*'
"*\**"    --> { "*", "\*", "*" }       -- contains '*' (works now)

I don't mind Regex.Split returning empty strings, but I end up with this:

Regex.Split(@"*\\*", @"(?<!\\)(\*)")  --> {"", "*", "\\*"}

As you can see, I have tried with negative lookbehind, which works for all my cases except this one. I have also tried Regex.Escape, but with no luck.

Obviously, my problem is that I am looking for \*, which \\* matches. But in this case, \\ is another escaped sequence.

Any solution doesn't necessary have to involve a Regex.

Upvotes: 18

Views: 4428

Answers (3)

Cruncher
Cruncher

Reputation: 7812

I figured a pure parsing, non-regex solution would be a good add to this question.

I could read this significantly faster than I could understand any of those regexes. This also makes fixing unexpected corner-cases easy. The logic is directly laid out.

public static String[] splitOnDelimiterWithEscape(String toSplit, char delimiter, char escape) {
    List<String> strings = new ArrayList<>();

    char[] chars = toSplit.toCharArray();
    String sub = "";

    for(int i = 0 ; i < chars.length ; i++) {
        if(chars[i] == escape) {
            sub += (i+1 < chars.length) ? chars[++i] : ""; //assign whatever char is after the escape to the string. This essentially makes single escape character non-existent. It just forces the next character to be literal. If the escape is at end, then we just ignore it

            //this is the simplest implementation of the escape. If escaping certain characters should have
            //special behaviour it should be implemented here.

            //You could even pass a Map mapping escape characters, to literal characters to make this even 
            //more general.

        } else if(chars[i] == delimiter) {
            strings.add(sub); //Found delimiter. So we split.
            sub = "";
        } else {
            sub += chars[i]; //nothing special. Just append to current string.
        }
    }

    strings.add(sub); //end of string is a boundary. Must include.

    return strings.toArray(new String[strings.size()]);
}

UPDATE: I'm a little bit confused about the question now actually. Splitting, as I've always known it, doesn't include the delimiting(but it looks like your examples do). If you want the delimiters to exist in the array, in their own slot then the modification from this is rather simple. (I'll leave it as an exercise for the reader as evidence for the code's maintainability)

Upvotes: 1

Vyktor
Vyktor

Reputation: 20997

I've came up with this regexp (?<=(?:^|[^\\])(?:\\\\)*)(\*).

Explanation:

You just white-list situations that can happen before * and these are:

  • start of the string ^
  • not \ - [^\\]
  • (not \ or beginning of the string) and then even number of \ - (^|[^\\])(\\\\)*

Test code and examples:

string[] tests = new string[]{
    @"*\\*",
    @"*\\\**",
    @"*\**",
    @"test\**test2",
};

Regex re = new Regex(@"(?<=(?:^|[^\\])(?:\\\\)*)(\*)");

foreach (string s in tests) {
    string[] m = re.Split( s );
    Console.WriteLine(String.Format("{0,-20} {1}", s, String.Join(", ",
       m.Where(x => !String.IsNullOrEmpty(x)))));
}

Result:

*\\*                 *, \\, *
*\\\**               *, \\\*, *
*\**                 *, \*, *
test\**test2         test\*, *, test2

Upvotes: 4

Jerry
Jerry

Reputation: 71538

I think it's much easier to match than to split, especially since you are not removing anything from the initial string. So what to match? Everything except an unescaped *.

How to do that? With the below regex:

@"(?:[^*\\]+|\\.)+|\*"

(?:[^*\\]+|\\.)+ matches everything that is not a *, or any escaped character. No need for any lookaround.

\* will match the separator.

In code:

using System;
using System.Text.RegularExpressions;
using System.Linq;
public class Test
{
    public static void Main()
    {   
        string[] tests = new string[]{
            @"*\\*",
            @"*\\\**",
            @"*\**",
        };

        Regex re = new Regex(@"(?:[^*\\]+|\\.)+|\*");

        foreach (string s in tests) {
            var parts = re.Matches(s)
             .OfType<Match>()
             .Select(m => m.Value)
             .ToList();

            Console.WriteLine(string.Join(", ", parts.ToArray()));
        }
    }
}

Output:

*, \\, *
*, \\\*, *
*, \*, *

ideone demo

Upvotes: 8

Related Questions