ryber
ryber

Reputation: 4555

Regex battle between maximum and minimum munge

Greetings, I have file with the following strings:

string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());

my goal is to get a match set with the two strings:

My current regex looks like this:

private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");

My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:

however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get

We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."

EDIT: To clarify. This is my lone edge case. Most of the time the string sits alone like:

var myString = "Hot Patootie".Localize()

Upvotes: 1

Views: 449

Answers (5)

vadim
vadim

Reputation: 1

new Regex(@"((([^@]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(@""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);

takes care of both simple and @"" strings. It also takes into account escape sequences.

Upvotes: 0

Greg Bacon
Greg Bacon

Reputation: 139611

Update:

My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.

Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.

The fix is to try an escaped quote first and then a non-quote:

var CSharpShortRegex =
  new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");

or if you prefer the at-string form:

var CSharpShortRegex =
  new Regex(@"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");

Allow for escapes:

private Regex CSharpShortRegex =
  new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");

Applying one level of escaping to make the pattern easier to read, we get

"(?<constant>([^"]|\\")*)"\.Localize\(\)

That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.

Upvotes: 1

Welbog
Welbog

Reputation: 60438

This one works for me:

\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)

Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.

Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):

\"             # We're looking for a string delimited by quotation marks
(              # Capture the contents of the quotation marks
  (?:          #   Start a non-capturing group
    [^\\"]     #     Either read a character that isn't a quote or a slash
    |(?:\\\")  #     Or read in a slash followed by a quote.
  )*           #   Keep reading
)              # End the capturing group
\"             # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )

For C# you'll need to escape the slashes twice (messy):

\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)

Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:

\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)

And its slashed-up equivalent:

\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)

Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.


Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.

Upvotes: 1

Josef Pfleger
Josef Pfleger

Reputation: 74527

Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:

var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
    new CompilerParameters { GenerateInMemory = true }, 
    "class x { public static string e() { return " + input + "}}");

var result = cr.CompiledAssembly.GetType("x")
    .GetMethod("e").Invoke(null, null) as string;

This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.

Upvotes: 0

Mark Byers
Mark Byers

Reputation: 838706

Here's the regular expression you need:

@"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"

A test program:

using System; using System.Text.RegularExpressions; using System.IO;

class Program
{
    static void Main()
    {
        Regex CSharpShortRegex =
            new Regex(@"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");

        foreach (string line in File.ReadAllLines("input.txt"))
            foreach (Match match in CSharpShortRegex.Matches(line))
                Console.WriteLine(match.Groups["constant"].Value);
    }
}

Output:

Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie

Notice that I have used @"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.

Upvotes: 1

Related Questions