catris25
catris25

Reputation: 1303

How do I remove the substrings started with capital letters in a Python string?

I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).

text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."

The title actually ends at the word Vaccines, the Before the pandemic is another sentence completely separate from the title.

How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the.

I know there is a function title() to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?

I have tried the following using regular expression.

import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res

But it just removed all words started with capital letters instead.

Upvotes: 1

Views: 686

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.

^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])

See the regex demo.

Details:

  • ^ - start of string
  • (?:Read\s+more\s*:)? - an optional non-capturing group matching Read, one or more whitespaces, more, zero or more whitespaces and a :
  • \s* - zero or more whitespaces
  • (?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)* - zero or more sequences of
    • (?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of) - an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title
    • \s+ - one or more whitespaces
  • (?=[A-Z]) - followed with an uppercase letter.

NOTE: You mentioned your language is not English, so

  1. You need to find the list of your language words that may go non-capitalized in a title and use them instead of ^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
  2. You might want to replace [A-Z] with \p{Lu} to match any Unicode uppercase letters and \S* with \p{L}* to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re does not support the Unicode category classes.

Upvotes: 2

Patrick
Patrick

Reputation: 26

Why don't you just use slicing?

title = text[:44]
print(title)

Read more: Indonesia to Get Moderna Vaccines

Upvotes: 0

Related Questions