Reputation: 201
I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple: "Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"
I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it). Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")
I coded the following:
for limit in range(1, len(Word)):
for k, v in dicti.iteritems():
if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
if '-' in v:
tmp = v.find('-')
end = v[tmp:]
end = re.sub(ur'[-]',"", end)
Word = Word[:limit] + '-' + end `
But I got 2 problems:
However; how would you solve this?
Upvotes: 1
Views: 263
Reputation: 33827
At the end of the words, it is printed out every time "
". How can I avoid this?
In must use UNICODE
everywhere in your script. Everywhere, everywhere, everywhere.
Also, python RegEx functions accept flag re.UNICODE
that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'
Upvotes: 1