Reputation: 169
I'm trying to split a string of words into a list of words via regex. I'm still a bit of a beginner with regular expressions.
I'm using nltk.regex_tokenize, which is yielding results that are close, but not quite what I want.
This is what I have so far:
>>> import re, codecs, nltk
>>> sentence = "détesté Rochard ! m'étais à... 'C'est hyper-cool.' :) :P"
>>> pattern = r"""(?x)
#words with internal hyphens
| \w+(-\w+)*
#ellipsis
| \.\.\.
#other punctuation tokens
| [][.,;!?"'():-_`]
"""
>>> nltk.regexp_tokenize(sentence.decode("utf8"), pattern)
[u'd\xe9test\xe9', u'Rochard', u'!', u'm', u"'", u'\xe9tais', u'\xe0', u'qu', u"'", u'on', u'...', u"'", u'C', u"'", u'est', u'hyper-cool', u'.', u"'", u':', u')', u':', u'P']
I would like to have the output as follows:
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0', u"qu'", u'on', u'...', u"'", u"C'", u'est', u'hyper-cool', u'.', u"'", u':)', u':P']
I have a workaround for the "emoticons", so what I'm most concerned with are quotes.
Upvotes: 0
Views: 579
Reputation: 63737
It seems that the desired output is not consistent with your input sentence
[u"qu'", u'on']
: I can't figure out from where did these two matches were determined from your sentenceu'.'
was not part of u'hyper-cool'
(Assuming you want the punctuation as part of the word.u"'"
was not part of u"C'"
. (Assuming you want the punctuation as part of the word.Also if you just want regex split, is there any reason why you are using nltk apart from splitting the lines? I have no experience with nltk
so would be proposing just a regex
solution.
>>> sentence
u"d\xe9test\xe9 Rochard ! m'\xe9tais \xe0... 'C'est hyper-cool.' :) :P"
>>> pattern=re.compile(
u"(" #Capturing Group
"(?:" #Non Capturing
"[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
"[\w\-]+" #Alphanumeric Unicode Word with hypen
"[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]?" #0-1 punctuation
")"
"|(?:[\.\.\.\]\[\.,;\!\?\"\'\(\):-_`]+)" #1- punctuation
")",re.UNICODE)
>>> pattern.findall(sentence)
[u'd\xe9test\xe9', u'Rochard', u'!', u"m'", u'\xe9tais', u'\xe0.', u'..', u"'C'", u'est', u'hyper-cool.', u"'", u':)', u':P']
See if this works for you
If you need more information on Capturing Group, Non-Capturing Group, Character Class, Unicode Match and findall I would suggest you take a cursory glance on the re package of python. Also I am not sure if the way you are continuing string in multiple lines is appropriate in this scenario. If you need more information on splitting string across lines (not multi-line strings) please have a look into this.
Upvotes: 1