Dylan Kinnett
Dylan Kinnett

Reputation: 240

Cleaning up Linebreaks with Regex

Often I'm copying text out of a PDF or similar and the line breaks aren't the way I want them. Instead of many short lines within each paragraph, I want each paragraph to be a single line of text, with a blank line between paragraphs.

Thanks to other answers on here I can fix this with regex in just a few steps:

  1. find all the double linebreaks [\r\n][\r\n] and replace them with a placeholder string like -------placeholder--------. don't worry, that placeholder will go back to being the space between paragraphs.
  2. now that we know where the paragraph breaks belong, it is safe to get rid of all the line breaks. replace [\r\n] with nothing.
  3. you should now have one single line of text for the entire document, with the placeholder string in place of the paragraph breaks.
  4. replace -------placeholder-------- with the double linebreaks [\r\n][\r\n]
  5. done!

But I'm curious: is there a way to do this with fewer steps? For example, is it possible in regex to say "find all line breaks, except pairs of line breaks, and replace with nothing"? This would eliminate the need for the placeholder step.

Upvotes: 0

Views: 178

Answers (2)

user557597
user557597

Reputation:

Yes its possible to do this with a single regex.
The approach is to find two letters separated by a line break.

Example:
This is first sentence in paragraph.\nThis is the second.

This is the second paragraph.


Make sense ?

This is available in two versions. With non-linebreak whitespace trimming
and without trimming.

 # Trimming:
 # Find:  (?<=\S)[^\S\r\n]*\r\n[^\S\r\n]*(?=\S)
 # Replace ' '

 (?<= \S )
 [^\S\r\n]* \r \n [^\S\r\n]* 
 (?= \S )

and

 # Non-Trimming
 # Find:   (\S[^\S\r\n]*)\r\n([^\S\r\n]*\S)
 # Replace: '$1 $2'

 ( \S [^\S\r\n]* )             # (1)
 \r \n 
 ( [^\S\r\n]* \S )             # (2)

Upvotes: 1

Gerino
Gerino

Reputation: 1983

Ok, I can tell you how it would work for just \n

In C#:

var input = "test\ntest2\n\ntest3\ntest4";
var regex = @"\n(?:(?=[^\n])(?<=[^\n]\n))";
var s2 = Regex.Replace(input,regex, "");
Console.WriteLine(s2);

Result:

testtest2

test3test4

And I think I got it for \r\n - but test it thoroughly ;)

var input = "test\r\ntest2\r\n\r\ntest3\r\ntest4";
var regex = @"(?<!\r\n)\r\n(?!\r\n)";

var s2 = Regex.Replace(input,regex, "");
Console.WriteLine(s2);

Result:

testtest2

test3test4

Upvotes: 0

Related Questions