buddemat
buddemat

Reputation: 5301

Regex to break a line at the last comma before a certain length / number of characters

Someone asked this as an encore to an otherwise unrelated question (C# Regex split by multiple closing brackets sets), so I'm adding it as a separate question:

I have a number of lines and if one of those lines is longer than 50 chars, I would like to split those lines at the last comma (,) before 50 characters.

Example input:

AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10', 'PM99', 'PM59' ) )
AND ( AUART IN ( 'PM01', 'PM01232132132', 'PM03' ) )

Expected output:

AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10',
 'PM99', 'PM59' ) )
AND ( AUART IN ( 'PM01', 'PM01232132132',
 'PM03' ) )

Upvotes: 0

Views: 210

Answers (2)

The fourth bird
The fourth bird

Reputation: 163372

You might also use a pattern without capture groups and a lookbehind asserting either the start of the string or a comma.

Then assert 50 chars to the right and match 1-49 characters followed by a comma.

In the replacement use the full match followed by a newline $0\n

(?<=^|,)(?=.{50}).{1,49},

Regex demo | C# demo

List<string> strings = new List<string>()
{
    "AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )",
    "AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10', 'PM99', 'PM59' ) )",
    "AND ( AUART IN ( 'PM01', 'PM01232132132', 'PM03' ) )",
    "AND ( AUART IN ( 'PM0654654654654654654654654654651', 'PM02' ) )",
    "AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM04', 'PM11', 'PM12', 'PM13', 'PM14', 'PM15', 'PM16', 'PM21', 'PM22', 'PM23', 'PM24', 'PM25', 'PM31' ) )"
};
var regex = new Regex(@"(?<=^|,)(?=.{50}).{1,49},");

foreach (String s in strings)
{
    Console.WriteLine(regex.Replace(s, "$0\n"));
}

Output

AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10',
 'PM99', 'PM59' ) )
AND ( AUART IN ( 'PM01', 'PM01232132132',
 'PM03' ) )
AND ( AUART IN ( 'PM0654654654654654654654654654651', 'PM02' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM04',
 'PM11', 'PM12', 'PM13', 'PM14', 'PM15', 'PM16',
 'PM21', 'PM22', 'PM23', 'PM24', 'PM25',
 'PM31' ) )

Upvotes: 2

buddemat
buddemat

Reputation: 5301

You can start your regex with a positive lookahead assertion that matches the whole line if it is longer than 50 characters, then add a negative lookbehind that makes sure there are less than 50 characters before the comma you want to match:

(?=.{50})(.*)(?<!.{50})(,)

Then you have found the comma that you want to split at or e.g. replace with a comma and a newline.

Full example:

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      string pattern = @"(?=.{50})(.*)(?<!.{50})(,)";
      string replacement = "$1,\n";
      List<string> inputs = new List<string>();
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )"); // shorter than 50 chars
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10', 'PM99', 'PM59' ) )");
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM01232132132', 'PM03' ) )");
      inputs.Add("AND ( AUART IN ( 'PM0654654654654654654654654654651', 'PM02' ) )"); // first comma appearing later than character 50
     
      foreach (string input in inputs)
      {
          string result = Regex.Replace(input, pattern, replacement);
          Console.WriteLine(result);
      }
   }
}

Note that this has some limitations:

  1. if there is a comma within the quotes ('), the regex will match, which may or may not be what you want.
  2. if the first comma appears later than at position 50, the regex will obviously not match
  3. if the line is longer than ~100 characters, the second part will be again longer than 50 characters, which I guess is not what you want

The last point can be addressed by capturing the remainder in a capture group and using a recursion to apply the regex again should the remainder be longer than 50 characters:

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
      List<string> inputs = new List<string>();
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )"); // shorter than 50 chars
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10', 'PM99', 'PM59' ) )");
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM01232132132', 'PM03' ) )");
      inputs.Add("AND ( AUART IN ( 'PM0654654654654654654654654654651', 'PM02' ) )"); // first comma appearing later than character 50
      inputs.Add("AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM04', 'PM11', 'PM12', 'PM13', 'PM14', 'PM15', 'PM16', 'PM21', 'PM22', 'PM23', 'PM24', 'PM25', 'PM31' ) )"); // string longer than 100 chars, i.e. the remainder needs to be processed again
      List<string> results = new List<string>();
       
      var regex = new Regex(@"((?=.{50}).*(?<!.{50}),)(.*)");
    
      foreach (string input in inputs)
      {
          string str = input;
          while(!String.IsNullOrEmpty(str)) {
             var match = regex.Match(str);
             if (match.Success) {
                 results.Add(match.Groups[1].Value);         
                 str = match.Groups[2].Value;
             } else {
                 results.Add(str);
                 break;
             }
          }
      }
      Console.WriteLine(String.Join("\n", results));
   }
}

Result:

AND ( AUART IN ( 'PM01', 'PM02', 'PM03' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM10',
 'PM99', 'PM59' ) )
AND ( AUART IN ( 'PM01', 'PM01232132132',
 'PM03' ) )
AND ( AUART IN ( 'PM0654654654654654654654654654651', 'PM02' ) )
AND ( AUART IN ( 'PM01', 'PM02', 'PM03', 'PM04',
 'PM11', 'PM12', 'PM13', 'PM14', 'PM15', 'PM16',
 'PM21', 'PM22', 'PM23', 'PM24', 'PM25',
 'PM31' ) )

To address the second point, you may consider breaking at commas , and whitespace \s instead of only commas (if that is possible in your application scenario). The regex for that would be ((?=.{50}).*(?<!.{50})[\s,])(.*).

Upvotes: 1

Related Questions