bevacqua
bevacqua

Reputation: 48476

Building a regex, how to remove redundant line breaks?

I have a string like this

"a           a            a  a aaa b c d e f a g a aaa  aa           a       a"

I want to turn it into either

"a b c d e f a g a"

or

"a                        b c d e f a g a                   "

(whichever's easier, it doesn't matter since it'll be HTML)

"a"s are line breaks ( \r\n ), in case that changes anything.

Upvotes: 0

Views: 1788

Answers (5)

bevacqua
bevacqua

Reputation: 48476

Went with this:

private string GetDescriptionFor(HtmlDocument document)
{
    string description = CrawlUsingMetadata(XPath.ResourceDescription, document);
    Regex regex = new Regex(@"(\r\n(?:[ ])*|\n(?:[ ])*){3,}", RegexOptions.Multiline | RegexOptions.IgnoreCase);//(?:[^\S\r\n|\n]*\1)+

    string result = regex.Replace(description, "\n\n");
    string decoded = HttpUtility.HtmlDecode(result);
    return decoded;
}

It does, as it's supposed to, ignore all line breaks except cases where it matches three or more continuous line breaks, ignoring whitespace, and replaces those matches with \n\n.

Upvotes: 1

Ωmega
Ωmega

Reputation: 43673

If you need C# code and you want to collapse JUST \r\n strings with leading and trailing whitespaces, then the solution is pretty simple:

string result = Regex.Replace(input, @"\s*\r\n\s*", "\r\n");

Check this code here.

Upvotes: 0

Ria
Ria

Reputation: 10347

Try this one:

Regex.Replace(inputString, @"(\r\n\s+)", " ");

Upvotes: -1

Ωmega
Ωmega

Reputation: 43673

Generally your code should be:

s.replace(new RegExp("(\\S)(?:\\s*\\1)+","g"), "$1"); 

Check this fiddle.

But, depends on what those characters a, b, c, ... represent in your case/question, you might need to change \\S to other class, such as [^ ], and then \\s to [ ], if you want to include \r and \n to being collapsed as well >>

s.replace(new RegExp("([^ ])(?:[ ]*\\1)+","g"), "$1");

Check this fiddle.

However if a is going to represent string \r\n, then you would need a little more complicated pattern >>

s.replace(new RegExp("(\\r\\n|\\S)(?:[^\\S\\r\\n]*\\1)+","g"), "$1");

Check this fiddle.

Upvotes: 1

Antal Spector-Zabusky
Antal Spector-Zabusky

Reputation: 36622

If I understand the problem correctly, the goal is to remove duplicate copies of a specific character/string, possibly separated by spaces. You can do that by replacing the regular expression (a\s*)+ with ; + for multiple consecutive copies, a\s* for as followed by spaces How precisely you do that depends on the language: in Perl it's $str =~ s/(a\s*)+/a /g, in Ruby it's str.gsub(/(a\s*)+/, "a "), and so on.

The fact that a is actually \r\n shouldn't complicate things, but might mean that the replacement would work better as s/(\r\n[ \t]*)+/\r\n/g (since \s overlaps with \r and \n).

Upvotes: 0

Related Questions