Reputation: 225
text = "This is a TEXT CONTAINING UPPER CASE WORDS and lower case words. This is a SECOND SENTENCE."
pattern = '[A-Z]+[A-Z]+[A-Z]*[\s]+'
re.findall(pattern, text)
gives an output -->
['TEXT ', 'CONTAINING ', 'UPPER ', 'CASE ', 'WORDS ', 'SECOND ', 'SENTENCE ']
However, I want an output something like this -->
['TEXT CONTAINING UPPER CASE WORDS', 'SECOND SENTENCE']
Upvotes: 4
Views: 8093
Reputation: 785176
You may use this regex:
\b[A-Z]+(?:\s+[A-Z]+)*\b
RegEx Details:
\b
: Word boundary[A-Z]+
: Match a word comprising only uppercase letters(?:\s+[A-Z]+)*
: Match 1+ whitespace followed by another word with uppercase letters. Match this group 0 or more times\b
: Word boundaryCode:
>>> s = 'This is a TEXT CONTAINING UPPER CASE WORDS and lower case words. This is a SECOND SENTENCE';
>>> print (re.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b', s))
['TEXT CONTAINING UPPER CASE WORDS', 'SECOND SENTENCE']
Upvotes: 14
Reputation: 54148
Improving regex, you want at least 2 uppercase letter, so use the dedicated syntax {2,}
for 2 or more, and use word boundary to be sure to catch the whole word
r'\b[A-Z]{2,}\b'
Do the job for each sentence : find them with a basic regex, and for each sentence, look for the uppercase words, then save them in an array by joining with a space
result = []
sentences = re.findall("[^.]+.", text)
for sentence in sentences:
uppercase = re.findall(pattern, sentence)
result.append(" ".join(uppercase))
print(result) # ['TEXT CONTAINING UPPER CASE WORDS', 'SECOND SENTENCE']
In a list-comprehension, it looks like
res = [" ".join(re.findall(pattern, sentence)) for sentence in re.findall("[^.]+.", text)]
Upvotes: 1