RD Ward
RD Ward

Reputation: 6737

RegEx to delete all double whitespace EXCEPT \n? preg_replace

I have imported a plain-text version of PDF using a Python script, but it has a bunch of garbage artifacts that I just don't care about.

The only whitespace I care about is (1) single spaces, and (2) double \n's.

Single space, for obvious reasons, between word boundaries. Double \n's, to demarcate between paragraphs.

The garbage whitespace it contains looks like this:

[\ \n\t]+ all jumbled together

Which leads me to another problem, sometimes the paragraphs are demarcated by

[\n][\s]+[\n]

I am not experienced enough with regex to make it ignore the inner whitespace between the two \n's. As an amateur RegExer, my problem is that \s includes \n.

If it didn't -- I think this would be a really easy problem to solve.

All other white space is irrelevant, and nothing I am trying is working really whatsoever.

Any suggestions would greatly be appreciated.

Sample text

Summary: The Department of Environment in Bangladesh seized 265 sacks of poultry feed 
tainted with tannery waste and various chemicals. 

Synthesis/Analysis: The Department of Environment seized the tainted poultry feed on 
28 March from a house in the city of Adabar located in Dhaka province. Workers were 
found in the house, which was used as an illegal factory, producing the tainted feed. The 
Bangladesh Environment Conservation Act allowed for a case to be filed against the 
factory’s manager, Mahmud Hossain, and the owner, who was not named. 

It was reported that the Department of Environment had also closed three other factories 
in Hazaribag a month prior to this instance for the same charges. The Bangladesh Council of 
Scientific and Industrial Research found that samples from the feed taken from these 
factories had “dangerous levels of chromium…”  The news report also stated that “poultry 
6 



and eggs became poisonous” from consuming the tainted feed, which would also cause 
health concerns for consumers. 

This is just leading me to more fixes... Gotta remove all the page numbers, and random double \n's.

Upvotes: 0

Views: 907

Answers (4)

bluepnume
bluepnume

Reputation: 17118

Ever so slightly hacky, but I think this should do the trick:

preg_replace('/\t|( ){2,}|(\n)\s+(\n)/', '\1\2\3', $data);

Bonus: doesn't require doing two passes on the string.

Upvotes: 0

Seth Robertson
Seth Robertson

Reputation: 31441

s/(\h)+/$1/g;
s/\n(\s*\n)+/\n\n/g;

The above will delete horizontal whitespace (I assume you are not worried about vertical tab or carriage return) and all newlines over double-space.

Upvotes: 0

MRAB
MRAB

Reputation: 20654

I think this will work:

  1. Replace "\s*\n\s*\n\s*" with "\n\n".

    This will standardise the paragraph separators.

  2. Replace "\s* \s*" with " ".

    This will standardise the word separators.

Upvotes: 1

mario
mario

Reputation: 145482

You can use an assertion to make \s exclude line breaks:

 ((?!\n)\s){2,}

To merge linebreaks with \n\s+\n spaces in between, you can use a similar construct in place of the \s+. But for simplicity I would just use two preg_matches and first merge linebreaks, then clean up double spaces.

Upvotes: 3

Related Questions