user2075215
user2075215

Reputation: 379

Insert newlines before sequences of all-caps words

I have a number of documents where I need to break up the text into chunks, the documents contains text where uppercase words need to be broken into sections

LORUM ipsum dolor sit amet, consectetur adipiscing elit, SED DO eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, TOTAM REP aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. NEQUE porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. UT ENIM AD minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

Would become

LORUM ipsum dolor sit amet, consectetur adipiscing elit, 

SED DO eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, 

TOTAM REP aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. NEQUE porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 

UT ENIM AD minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?

I've tried searching for \b[A-Z](.*?)+\b which returns the uppercase words, and I've tried \b[A-Z](.*?)+\b(.*?)\b[A-Z](.*?)+\b which comes close for a couple of documents but fails on others including the Lorem Ipsum example.

Upvotes: 2

Views: 79

Answers (4)

Mosab Sasi
Mosab Sasi

Reputation: 1130

Try searching for this regex: (\s)(([A-Z]+\s\b)+)

and replace with this: \n\2 or this:\n\n\2 for two lines in between.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

A matching approach here consists in matching consecutive space separated ALLCAPS words and then matching any char that is not starting a sequence of 2 uppercase letters:

\b[A-Z]+(?:\s+[A-Z]+)*(?:(?![A-Z]{2}).)*

See the regex demo

If the ALLCAPS words must consist of at least 2 letters, use limiting quantifiers instead of +:

\b[A-Z]{2,}(?:\s+[A-Z]{2,})*(?:(?![A-Z]{2}).)*
       ^^^            ^^^^

Pattern details:

  • \b - a leading word boundary
  • [A-Z]+ - 1 or more uppercase ASCII letters
  • (?:\s+[A-Z]+)* - zero or more sequences of:
    • \s+ - 1+ whitespaces
    • [A-Z]+ - 1+ uppercase ASCII letters
  • (?:(?![A-Z]{2}).)* - a tempered greedy token matching any char that is not starting a sequence of 2 uppercase ASCII letters.

Upvotes: 2

AbraCadaver
AbraCadaver

Reputation: 78994

preg_split() will get part of the way:

$result = preg_split('/([A-Z][A-Z ]+)/',
                     $string,
                     null,
                     PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
  • Split on an uppercase letter followed by more uppercase letters or spaces [A-Z][A-Z ]+
  • Capture the match () as well with PREG_SPLIT_DELIM_CAPTURE

Then, unless someone has a better way within the preg_split():

$result = array_map(function($v) {
                        return implode(' ', $v);
                    },
                    array_chunk($result, 2));
  • Chunk the array into pairs of the uppercase match and what comes after
  • Implode the pairs

Then if you want it back to a string with newlines:

$result = implode("\n", $result);

Upvotes: 1

DrRoach
DrRoach

Reputation: 1356

This regular expression should work: [A-Z]\w+ it selects all words \w+ that are uppercase [A-Z]

Upvotes: -1

Related Questions