Reputation: 1303
I have this string which is a mix between a title and a regular sentence (there is no separator separating the two).
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
The title actually ends at the word Vaccines
, the Before the pandemic
is another sentence completely separate from the title.
How do I remove the substring until the word vaccines? My idea was to remove all words from the words "Read more:" to all the words after that that start with capital until before one word (before
). But I don't know what to do if it meets with conjunction or preposition that doesn't need to be capitalized in a title, like the word the
.
I know there is a function title()
to convert a string into a title format in Python, but is there any function that can detect if a substring is a title?
I have tried the following using regular expression.
import re
text = "Read more: Indonesia to Get Moderna Vaccines Before the pandemic began, a lot of people were...."
res = re.sub(r"\s*[A-Z]\s*", " ", text)
res
But it just removed all words started with capital letters instead.
Upvotes: 1
Views: 686
Reputation: 627082
You can match the title by matching a sequence of capitalized words and words that can be non-capitalized in titles.
^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*(?=[A-Z])
See the regex demo.
Details:
^
- start of string(?:Read\s+more\s*:)?
- an optional non-capturing group matching Read
, one or more whitespaces, more
, zero or more whitespaces and a :
\s*
- zero or more whitespaces(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)\s+)*
- zero or more sequences of
(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of)
- an capitalized word that may contain any non-whitespace chars or one of the words that can stay non-capitalized in an English title\s+
- one or more whitespaces(?=[A-Z])
- followed with an uppercase letter.NOTE: You mentioned your language is not English, so
^(?:Read\s+more\s*:)?\s*(?:(?:[A-Z]\S*|the|an?|[io]n|at|with(?:out)?|from|for|and|but|n?or|yet|[st]o|around|by|after|along|from|of
[A-Z]
with \p{Lu}
to match any Unicode uppercase letters and \S*
with \p{L}*
to match any zero or more Unicode letters, BUT make sure you use the PyPi regex library then as Python built-in re
does not support the Unicode category classes.Upvotes: 2
Reputation: 26
Why don't you just use slicing?
title = text[:44]
print(title)
Read more: Indonesia to Get Moderna Vaccines
Upvotes: 0