Reputation: 2717

Tokenizing unsplit workds using NLTK/Python3

I have unsplit words such as PageMetadataServiceConsumer, PowerSellerUpdateConsumerApplication, MetaDataDomain etc. These are words that don't have any punctuation or verbs. But when we look at the word, we know what they are made up of.

Is there a way to split PowerSellerUpdateConsumerApplication into Power,Seller, Update,Consumer, Application using nltk?

Upvotes: 0

Answers (2)

kaza

Reputation: 2327

import re
s='PageMetadataServiceConsumer, PowerSellerUpdateConsumerApplication, MetaDataDomain'
reg=r'[A-Z](?![a-z]*\b)[a-z]+'
a=re.sub(reg,'\g<0> ',s)
print(a)

OUTPUT

Page Metadata Service Consumer, Power Seller Update Consumer Application, Meta Data Domain

Explanation

[A-Z]        #First char with capital letter
(?!          #START Negative Look ahead: Do not match if the first char is followed by this
[a-z]*\b    #do not match if it ends with a word boundary \b(last part)
)            #END Negative Look ahead
[a-z]+      #Select all the remaining lower case chars.


a=re.sub(reg,'\g<0> ',s) #Replace the matches with match \g<0> by appending a space to it.

Working regex here. Working python example here.

If you just want the words then use the below:-

reg=r'[A-Z]+[a-z]+'
for a in re.findall(reg,s):
  print(a)

OUTPUT

Page
Metadata
Service
Consumer
Power
Seller
Update
Consumer
Application
Meta
Data
Domain

Upvotes: 0

Mustofa Rizwan

Reputation: 10466

You may try the following approach :

The idea is to append a splitter string (in the following string it is ###) to the left of Uppercase character(s) ... If you somehow think that ### may appear as a string then you may use anything like ~!@*@&$@#! or whatever you think is 100% safe to not appear in the string at all.

Run Here

import re

regex = r"([A-Z]+)"
test_str = "agePowerSellerUpdateConsumerApplicationMetaDataDomainageMetadataServiceConsumerBBc"
subst = "###\\1"
result = re.sub(regex, subst, test_str, 0)

if result:
  print(re.split("###", result))

Upvotes: 1

Tokenizing unsplit workds using NLTK/Python3

Answers (2)

Related Questions