ktflghm
ktflghm

Reputation: 169

Splitting a string after punctuation while including punctuation

I'm trying to split a string of words into a list of words via regex. I'm still a bit of a beginner with regular expressions.

I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want.

This is what I have so far:

>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"    
>>> pattern = r"""(?x)
    #words with internal hyphens
    | \w+(-\w+)*
    #ellipsis
    | \.\.\.
    #other punctuation tokens
    | [][.,;!?"'():-_`]
    """ 
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']

I would like to have the output as follows:

[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']

I have a workaround for the "emoticons", so what I'm most concerned with are quotes.

Upvotes: 0

Views: 579

Answers (1)

Abhijit
Abhijit

Reputation: 63737

It seems that the desired output is not consistent with your input sentence

  1. [u"qu'", u'on'] : I can't figure out from where did these two matches were determined from your sentence
  2. Why u'.' was not part of u'hyper-cool' (Assuming you want the punctuation as part of the word.
  3. Why u"'" was not part of u"C'". (Assuming you want the punctuation as part of the word.

Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? I have no experience with nltk so would be proposing just a regex solution.

>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
    u"(" #Capturing Group
    "(?:" #Non Capturing
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    "[\w\-]+"                           #Alphanumeric Unicode Word with hypen
    "[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
    ")"
    "|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
     ")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']

See if this works for you

If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python. Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. If you need more information on splitting string across lines (not multi-line strings) please have a look into this.

Upvotes: 1

Related Questions