martina casati
martina casati

Reputation: 11

How to split string by a series of uppercase words with regex

Giving a text like this :

text= "THE TEXT contains uppercase letter, but ALSO LOWER case ones. This is another sentence."

I want an output something like this -->

['THE TEXT contains uppercase letter, but', 'ALSO LOWER case ones. This is another sentence.']

How can i write a regex to obtain that output?

I tried with this regex "(\b[A-Z][A-Z]+(?:\s+[A-Z][A-Z]+)*\b)" but the output was differnt:

[ '',
 'THE TEXT',
 'contains uppercase letter, but',
 'ALSO LOWER',
  'case ones. This is another sentence.']

Upvotes: 1

Views: 480

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

You can match and extract them with

re.findall(r'\b[A-Z]{2,}(?:\s+[A-Z]{2,})*\b.*?(?=\s*\b[A-Z]{2}|$)', text, re.DOTALL)

See the regex demo.

Details:

  • \b[A-Z]{2,}(?:\s+[A-Z]{2,})*\b - word boundary, two or more uppercase letters, zero or more repetitions of one or more whitespaces, two or more ASCII uppercase letters and a word boundary
  • .*? - any zero or more chars as few as possible
  • (?=\s*\b[A-Z]{2}|$) - a positive lookahead that matches a location that is immediately followed with zero or more whitespaces, word boundary and two uppercase letters, or end of string.

Upvotes: 1

Related Questions