Reputation: 2717
I have unsplit words such as PageMetadataServiceConsumer
, PowerSellerUpdateConsumerApplication
, MetaDataDomain
etc. These are words that don't have any punctuation or verbs. But when we look at the word, we know what they are made up of.
Is there a way to split PowerSellerUpdateConsumerApplication
into Power
,Seller
, Update
,Consumer
, Application
using nltk?
Upvotes: 0
Views: 44
Reputation: 2327
import re
s='PageMetadataServiceConsumer, PowerSellerUpdateConsumerApplication, MetaDataDomain'
reg=r'[A-Z](?![a-z]*\b)[a-z]+'
a=re.sub(reg,'\g<0> ',s)
print(a)
OUTPUT
Page Metadata Service Consumer, Power Seller Update Consumer Application, Meta Data Domain
Explanation
[A-Z] #First char with capital letter
(?! #START Negative Look ahead: Do not match if the first char is followed by this
[a-z]*\b #do not match if it ends with a word boundary \b(last part)
) #END Negative Look ahead
[a-z]+ #Select all the remaining lower case chars.
a=re.sub(reg,'\g<0> ',s) #Replace the matches with match \g<0> by appending a space to it.
Working regex here. Working python example here.
If you just want the words then use the below:-
reg=r'[A-Z]+[a-z]+'
for a in re.findall(reg,s):
print(a)
OUTPUT
Page
Metadata
Service
Consumer
Power
Seller
Update
Consumer
Application
Meta
Data
Domain
Upvotes: 0
Reputation: 10466
You may try the following approach :
The idea is to append a splitter string (in the following string it is ###) to the left of Uppercase character(s) ... If you somehow think that ### may appear as a string then you may use anything like ~!@*@&$@#! or whatever you think is 100% safe to not appear in the string at all.
import re
regex = r"([A-Z]+)"
test_str = "agePowerSellerUpdateConsumerApplicationMetaDataDomainageMetadataServiceConsumerBBc"
subst = "###\\1"
result = re.sub(regex, subst, test_str, 0)
if result:
print(re.split("###", result))
Upvotes: 1