DWE
DWE

Reputation: 137

Removal of colon and carriage returns and replace with colon

I'm working on a project where I have a HMTL fragment which needs to be cleaned up - the HTML has been removed and as a result of table being removed, there are some strange ends where they shouldnt be :-)

the characters as they appear are

  1. a space at the beginning of a line
  2. a colon, carriage return and linefeed at the end of the line - which needs to be replaced simply with the colon;

I am presently using regex as follows:

s = Regex.Replace(s, @"(:[\r\n])", ":", RegexOptions.Multiline | RegexOptions.IgnoreCase);

// gets rid of the leading space
s = Regex.Replace(s, @"(^[( )])", "", RegexOptions.Multiline | RegexOptions.IgnoreCase);

Example of what I am dealing with:

Tomas Adams

Solicitor
APLawyers
p:
1800 995 718
f:
07 3102 9135
a:
22 Fultam Street
 PO Box 132, Booboobawah QLD 4113

which should look like:

Tomas Adams
Solicitor
APLawyers
p:1800 995 718
f:07 3102 9135
a:22 Fultam Street
PO Box 132, Booboobawah QLD 4313

as my attempt to clean the string, but the result is far from perfect ... Can someone assist me to correct the error and achive my goal ...

[EDIT] the offending characters

f:\r\n07 3102 9135\r\na:\r\n22 

the combination of :\r\n should be replaced by a single colon.

MTIA

Darrin

Upvotes: 3

Views: 470

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626961

You may use

var result = Regex.Replace(s, @"(?m)^\s+|(?<=:)(?:\r?\n)+|(\r?\n){2,}", "$1")

See the .NET regex demo.

Details

  • (?m) - equal to RegexOptions.Multiline - makes ^ match the start of any line here
  • ^ - start of a line
  • \s+ - 1+ whitespaces
  • | - or
  • (?<=:)(?:\r?\n)+ - a position that is immediately preceded with : (matched with (?<=:) positive lookbehind) followed with 1+ occurrences of an optional CR and LF (those are removed)
  • | - or
  • (\r?\n){2,} - two or more consecutive occurrences of an optional CR followed with an LF symbol. Only the last occurrence is saved in Group 1 memory buffer, thus the $1 replacement pattern inserts that last, single, occurrence.

Upvotes: 1

aloisdg
aloisdg

Reputation: 23521

A Linq solution without Regex:

var tmp = string.Empty;
var output = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries).Aggregate(new StringBuilder(), (a,b) => {
    if (b.EndsWith(":")) {  // feel free to also check for the size
        tmp = b;
    }
    else {
        a.AppendLine((tmp + b).Trim()); // remove space before or after a line
        tmp = string.Empty;
    }
    return a;
});

Try it Online!

Upvotes: 0

aloisdg
aloisdg

Reputation: 23521

A basic solution without Regex:

var lines = input.Split(new []{"\n"}, StringSplitOptions.RemoveEmptyEntries);
var output = new StringBuilder();
for (var i = 0; i < lines.Length; i++)
{
    if (lines[i].EndsWith(":")) // feel free to also check for the size
    {
        lines[i + 1] =  lines[i] + lines[i + 1];
        continue;
    }
    output.AppendLine(lines[i].Trim()); // remove space before or after a line
}

Try it Online!

Upvotes: 1

ashish
ashish

Reputation: 2358

I tried to use your regular expression.I was able to replace "\n" and ":" with the following regular expression.This is removing ":" and "\n" at the end of the line. @"([:\r\n])"

Upvotes: 0

Related Questions