Reputation: 16271
I have to match multiple occurrences of tokens in a document and get the value and the position of the matched token.
For non-Unicode text I'm using this regex r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
with finditer
and it works.
For Unicode text I must use a word-boundary like solution like u"(\s|^)%s(\s|$)" % word
. This will work in most of cases, but not when I have two consecutive words like in "तुम मुझे दोस्त कहते कहते हो".
This is the code to reproduce this issue.
import re
import json
# a input document of sentences
document="These are oranges and apples and and pears, but not pinapples\nThese are oranges and apples and pears, but not pinapples"
# uncomment to test UNICODE
document="तुम मुझे दोस्त कहते कहते हो"
sentences=[] # sentences
seen = {} # map if a token has been see already!
# split into sentences
lines=document.splitlines()
for index,line in enumerate(lines):
print("Line:%d %s" % (index,line))
# split token that are words
# LP: (for Simon ;P we do not care of punct at all!
rgx = re.compile("([\w][\w']*\w)")
tokens=rgx.findall(line)
# uncomment to test UNICODE
tokens=["तुम","मुझे","दोस्त","कहते","कहते","हो"]
print("Tokens:",tokens)
sentence={} # a sentence
items=[] # word tokens
# for each token word
for index_word,word in enumerate(tokens):
# uncomment to test UNICODE
my_regex = u"(\s|^)%s(\s|$)" % word
#my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
r = re.compile(my_regex, flags=re.I | re.X | re.UNICODE)
item = {}
# for each matched token in sentence
for m in r.finditer(document):
token=m.group()
characterOffsetBegin=m.start()
characterOffsetEnd=characterOffsetBegin+len(m.group()) - 1 # LP: star from 0
print ("word:%s characterOffsetBegin:%d characterOffsetEnd:%d" % (token, characterOffsetBegin, characterOffsetEnd) )
found=-1
if word in seen:
found=seen[word]
if characterOffsetBegin > found:
# store last word has been seen
seen[word] = characterOffsetBegin
item['index']=index_word+1 #// word index starts from 1
item['word']=token
item['characterOffsetBegin'] = characterOffsetBegin;
item['characterOffsetEnd'] = characterOffsetEnd;
items.append(item)
break
sentence['text']=line
sentence['tokens']=items
sentences.append(sentence)
print(json.dumps(sentences, indent=4, sort_keys=True))
print("------ testing ------")
text=''
for sentence in sentences:
for token in sentence['tokens']:
# LP: we get the token from a slice in original text
text = text + document[token['characterOffsetBegin']:token['characterOffsetEnd']+1] + " "
text = text + '\n'
print(text)
Specifically for the token कहते
I will get the same match, instead of the next token.
word: कहते characterOffsetBegin:20 characterOffsetEnd:25
word: कहते characterOffsetBegin:20 characterOffsetEnd:25
Upvotes: 1
Views: 53
Reputation: 626794
For non-Unicode text, you may use a better regex like
my_regex = r"(?<!\w){}(?!\w)".format(re.escape(word))
Yours won't work if the word
starts with a non-word char. The (?<!\w)
negative lookbehind fails the match if there is a word char immediately to the left of the current location and the (?!\w)
negative lookahead fails the match if there is a word char immediately to the right of the current location.
The second problem with the Unicode text regex is that the second group consumes whitespace after a word, and thus it is not available for the consequent match. It is convenient to use lookarounds here:
my_regex = r"(?<!\S){}(?!\S)".format(re.escape(word))
See this Python demo online.
The (?<!\S)
negative lookbehind fails the match if there is a non-whitespace char immediately to the left of the current location and the (?!\S)
negative lookahead fails the match if there is a non-whitespace char immediately to the right of the current location.
Upvotes: 1