CorreiaD
CorreiaD

Reputation: 51

REG EXP PHP. Remove break lines if at the end of the line before is not a dot there

I got this text

possono godere di la spiaggia, situato a 7 km da il porto turistico di A , a 5 chilometri da l'aeroporto di B. ALBERGO: formato da monolocali, appartamenti con

And I need something like this with preg_replace

possono godere di la spiaggia, situato a 7 km da il porto turistico di A, a 5 chilometri da l'aeroporto di B. ALBERGO: formato da monolocali, appartamenti con

I use regular expressions like '/[^\.]\n/' but it takes the space after 'B.' too.

Upvotes: 1

Views: 67

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

Use

$str = 'possono 
 godere 
 di la spiaggia, situato a 7 km da il porto         turistico di A , a 5 chilometri da l\'aeroporto di 
 B.
ALBERGO: formato da monolocali, appartamenti con';
$res = preg_replace('~\s+(?!^[A-Z]+:)~um', ' ', $str);
echo $res;

See the PHP demo

The \s+(?!^[A-Z]+:) matches:

  • \s+ - 1 or more whitespaces that are not immediately followed with...
  • (?!^[A-Z]+:) - start of line (^, m modifier makes ^ match the beginning of a line instead of a string), 1+ uppercase ASCII letters (see [A-Z]+) and a :.

The /u modifier is used just in case the strings contain Unicode letters. Also, in that case, replace [A-Z] with \p{Lu}.

Upvotes: 1

friedemann_bach
friedemann_bach

Reputation: 1458

I think this process should be split up into more tasks. My proposal:

  1. Tidy up all whitespace sequences (\s+) and normalize them to one standard space (remember to set the "global" flag).

  2. Restructure the text by identifying semantic markers like "ALBERGO: " and place a line feed \n before it. You could even search for ". ALBERGO: " and replace it by ".\nALBERGO: "

  3. Standardize (or beautify) the text by identifying singularized commas " , " and replace them with ", ".

Upvotes: 0

Related Questions