Reputation: 6737
I have imported a plain-text version of PDF using a Python script, but it has a bunch of garbage artifacts that I just don't care about.
The only whitespace I care about is (1) single spaces, and (2) double \n's.
Single space, for obvious reasons, between word boundaries. Double \n's, to demarcate between paragraphs.
The garbage whitespace it contains looks like this:
[\ \n\t]+ all jumbled together
Which leads me to another problem, sometimes the paragraphs are demarcated by
[\n][\s]+[\n]
I am not experienced enough with regex to make it ignore the inner whitespace between the two \n
's. As an amateur RegExer, my problem is that \s
includes \n
.
If it didn't -- I think this would be a really easy problem to solve.
All other white space is irrelevant, and nothing I am trying is working really whatsoever.
Any suggestions would greatly be appreciated.
Summary: The Department of Environment in Bangladesh seized 265 sacks of poultry feed
tainted with tannery waste and various chemicals.
Synthesis/Analysis: The Department of Environment seized the tainted poultry feed on
28 March from a house in the city of Adabar located in Dhaka province. Workers were
found in the house, which was used as an illegal factory, producing the tainted feed. The
Bangladesh Environment Conservation Act allowed for a case to be filed against the
factory’s manager, Mahmud Hossain, and the owner, who was not named.
It was reported that the Department of Environment had also closed three other factories
in Hazaribag a month prior to this instance for the same charges. The Bangladesh Council of
Scientific and Industrial Research found that samples from the feed taken from these
factories had “dangerous levels of chromium…” The news report also stated that “poultry
6
and eggs became poisonous” from consuming the tainted feed, which would also cause
health concerns for consumers.
This is just leading me to more fixes... Gotta remove all the page numbers, and random double \n's.
Upvotes: 0
Views: 907
Reputation: 17118
Ever so slightly hacky, but I think this should do the trick:
preg_replace('/\t|( ){2,}|(\n)\s+(\n)/', '\1\2\3', $data);
Bonus: doesn't require doing two passes on the string.
Upvotes: 0
Reputation: 31441
s/(\h)+/$1/g;
s/\n(\s*\n)+/\n\n/g;
The above will delete horizontal whitespace (I assume you are not worried about vertical tab or carriage return) and all newlines over double-space.
Upvotes: 0
Reputation: 20654
I think this will work:
Replace "\s*\n\s*\n\s*" with "\n\n".
This will standardise the paragraph separators.
Replace "\s* \s*" with " ".
This will standardise the word separators.
Upvotes: 1
Reputation: 145482
You can use an assertion to make \s
exclude line breaks:
((?!\n)\s){2,}
To merge linebreaks with \n\s+\n
spaces in between, you can use a similar construct in place of the \s+
. But for simplicity I would just use two preg_match
es and first merge linebreaks, then clean up double spaces.
Upvotes: 3