Reputation: 79
I am trying to split a string, where multi-word proper nouns are recognized as one token. For example, the following code needs to be changed,
import re
s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
out = re.compile("\s").split(s)
print(out)
in order to get this desired outcome:
['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']
I have found this, but I am not able to incorporate it to the code correctly.
Thanks in advance!
Upvotes: 1
Views: 86
Reputation: 163362
You could match consecutive words starting with an uppercase char followed by 1+ lowercase chars with either a space or -
in between to get a single match for Multi-Criteria Decision Making.
To match the other words, you can use an alternation |
to match 1 or more word characters.
[A-Z][a-z]+(?:[ -][A-Z][a-z]+)*|\w+
If there should be a part following with 2 or more uppercase chars between parenthesis, you could use a positive lookahead.
Note that the lookahead only checks for the presence of uppercase chars, it does not match the exact same uppercase chars from the preceding words.
[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+
import re
s = 'Multi-Criteria Decision Making (MCDM) is increasingly used in RE projects.'
pattern = r'[A-Z][a-z]+(?:[ -][A-Z][a-z]+)+(?= \([A-Z]{2,}\))|\w+'
print(re.findall(pattern, s))
Output
['Multi-Criteria Decision Making', 'MCDM', 'is', 'increasingly', 'used', 'in', 'RE', 'projects']
Upvotes: 1