How do I remove the substrings started with capital letters in a Python string?

Question

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).

text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."

The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.

How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.

I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?

I have tried the following using regular expression.

import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res

But it just removed all words started with capital letters instead.

Wiktor Stribiżew · Accepted Answer

You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.

^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])

See the regex demo.

Details:

^ - start of string
(?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
\s* - zero or more whitespaces
(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
- (?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
- \s+ - one or more whitespaces
(?=[A-Z]) - followed with an uppercase letter.

NOTE: You mentioned your language is not English, so

You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.

How do I remove the substrings started with capital letters in a Python string?

Answers (2)

Related Questions