Split sentence by “words”, treating multiple capital words (assumed to be proper nouns) as one

Question

I am trying to split a string, where multi-word proper nouns are recognized as one token. For example, the following code needs to be changed,

import re

s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
out = re.compile("\s").split(s)

print(out)

in order to get this desired outcome:

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']

I have found this, but I am not able to incorporate it to the code correctly.

Thanks in advance!

The fourth bird · Accepted Answer

You could match consecutive words starting with an uppercase char followed by 1+ lowercase chars with either a space or - in between to get a single match for Multi-Criteria Decision Making.

To match the other words, you can use an alternation | to match 1 or more word characters.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)*|\w+

Regex demo

If there should be a part following with 2 or more uppercase chars between parenthesis, you could use a positive lookahead.

Note that the lookahead only checks for the presence of uppercase chars, it does not match the exact same uppercase chars from the preceding words.

[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= $[A-Z]{2,}$)|\w+

Regex demo | Python demo

import re
 
s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
pattern = r'[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= $[A-Z]{2,}$)|\w+'
print(re.findall(pattern, s))

Output

['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']

Split sentence by “words”, treating multiple capital words (assumed to be proper nouns) as one

Answers (1)

Related Questions