Ryan Thomas
Ryan Thomas

Reputation: 2002

Get values from a string based on a format

I am trying to get some individual values from a string based on a format, now this format can change so ideally, I want to specify this using another string.

For example let's say my input is 1. Line One - Part Two (Optional Third Part) I would want to specify the format as to match so %number%. %first% - %second% (%third%) and then I want these values as variables.

Now the only way I could think of doing this was using RegEx groups and I have very nearly got RegEx works.

var input = "1. Line One - Part Two (Optional Third Part)";

var formatString = "%number%. %first% - %second% (%third%)";
    
var expression = new Regex("(?<Number>[^.]+). (?<First>[^-]+) - (?<Second>[^\\(]+) ((?<Third>[^)]+))");
    
var match = expression.Match(input);
    
Console.WriteLine(match.Groups["Number"].ToString().Trim());
Console.WriteLine(match.Groups["First"].ToString().Trim());
Console.WriteLine(match.Groups["Second"].ToString().Trim());
Console.WriteLine(match.Groups["Third"].ToString().Trim());

This results in the following output, so all good apart from that opening bracket.

1 Line One Part Two (Optional Third Part

I'm now a bit lost as to how I could translate my format string into a regular expression, now there are no rules on this format, but it would need to be fairly easy for a user.

Any advice is greatly appreciated, or perhaps there is another way not involving Regex?

Upvotes: 2

Views: 814

Answers (3)

NetMage
NetMage

Reputation: 26926

Your format contains special characters that are becoming part of the regular expression. You can use the Regex.Escape method to handle that. After that, you can just use a Regex.Replace with a delegate to transform the format into a regular expression:

var input = "1. Line One - Part Two (Optional Third Part)";
var fmt = "%number%. %first% - %second% (%third%)";

var templateRE = new Regex(@"%([a-z]+)%", RegexOptions.Compiled);
var pattern = templateRE.Replace(Regex.Escape(fmt), m => $"(?<{m.Groups[1].Value}>.+?)");

var ansRE = new Regex(pattern);
var ans = ansRE.Match(input);

Note: You may want to place ^ and $ at the beginning and end of the pattern respectively, to ensure the format must match the entire input string.

Upvotes: 0

Michał Turczyn
Michał Turczyn

Reputation: 37430

You included in your pattern couple of special characters (such as .) without escaping them, so Regex does not match . literlally.

Here's corrected code of yours:

using System.Text.RegularExpressions;

var input = "1. Line One - Part Two (Optional Third Part)";

var pattern = string.Format(
    "(?<Number>{0})\\. (?<First>{1}) - (?<Second>{2}) \\((?<Third>{3})\\)", 
    "[^\\.]+", 
    "[^\\-]+", 
    "[^\\(]+", 
    "[^\\)]+");

var match = Regex.Match(input, pattern);

Console.WriteLine(match.Groups["Number"]);
Console.WriteLine(match.Groups["First"]);
Console.WriteLine(match.Groups["Second"]);
Console.WriteLine(match.Groups["Third"]);

Sample output:
enter image description here

If you want to keep you syntax, you can leverage Regex.Escape method. I also written some code that parses all parameters within %

using System.Text.RegularExpressions;

var input = "1. Line One - Part Two (Optional Third Part)";

var formatString = "%number%. %first% - %second% (%third%)";

formatString = Regex.Escape(formatString);

var parameters = new List<string>();
formatString = Regex.Replace(formatString, "%([^%]+)%", match =>
{
    var paramName = match.Groups[1].Value;
    var groupPattern = "(?<" + paramName + ">{" + parameters.Count + "})";
    parameters.Add(paramName);
    return groupPattern;
});

var pattern = string.Format(
    formatString, 
    "[^\\.]+", 
    "[^\\-]+", 
    "[^\\(]+", 
    "[^\\)]+");

var match = Regex.Match(input, pattern);

foreach (var paramName in parameters)
{
    Console.WriteLine(match.Groups[paramName]);
}

Further notes

You need to adjust part where you specify pattern for each group, currently it's not generic and does not care about how many paramters there would be.

So finally, taking it all into account and cleaning up the code a little, you can use such solution:

public static class FormatBasedCustomRegex
{
    public static string GetPattern(this string formatString,
        string[] subpatterns,
        out string[] parameters)
    {
        formatString = Regex.Escape(formatString);

        formatString = formatString.ReplaceParams(out var @params);

        if(@params.Length != subpatterns.Length)
        {
            throw new InvalidOperationException();
        }

        parameters = @params;

        return string.Format(
            formatString,
            subpatterns);
    }

    private static string ReplaceParams(
        this string formatString, 
        out string[] parameters)
    {
        var @params = new List<string>();
        var outputPattern = Regex.Replace(formatString, "%([^%]+)%", match =>
        {
            var paramName = match.Groups[1].Value;
            var groupPattern = "(?<" + paramName + ">{" + @params.Count + "})";
            @params.Add(paramName);
            return groupPattern;
        });

        parameters = @params.ToArray();

        return outputPattern;
    }
}

and main method would look like:


var input = "1. Line One - Part Two (Optional Third Part)";

var pattern = "%number%. %first% - %second% (%third%)".GetPattern(
    new[] 
    {
        "[^\\.]+",
        "[^\\-]+",
        "[^\\(]+",
        "[^\\)]+",
    },
    out var parameters);

var match = Regex.Match(input, pattern);

foreach (var paramName in parameters)
{
    Console.WriteLine(match.Groups[paramName]);
}

But it's up to you how would you define particular methods and what signatures they should have for you to have the best code :)

Upvotes: 2

anubhava
anubhava

Reputation: 785611

You may use this regex:

^(?<Number>[^.]+)\. (?<First>[^-]+) - (?<Second>[^(]+)(?: \((?<Third>[^)]+)\))?$

RegEx Demo

RegEx Details:

  • ^: Start
  • (?<Number>[^.]+): Match and capture 1+ of any char that is not .
  • \. : Match ". "
  • (?<First>[^-]+):
  • -: Match " - "
  • (?<Second>[^(]+): Match and capture 1+ of any char that is not (
  • (?:: Start a non-capture group
    • \(: Match space followed by (
    • (?<Third>[^)]+): Match and capture 1+ of any char that is not )
    • \): Match )
  • )?: End optional non-capture group
  • $: End

Upvotes: 1

Related Questions