Reputation: 11
I'm creating a wordlist from a .txt file (with 65000 words) with the collections.counter() and findall() functions. It works well for English. However it ignores the special characters in other languages, like â, á, ü, ö etc. Furthermore I want combined words like "t'appele" and "signifie-t-elle" to be added as one distinct word. I have tried all sorts of regex combinations without success. Does someone know how to make it include the special characters? Below is my code.
with open(text_to_load) as f:
words_from_text = collections.Counter(
word.lower()
for line in f
for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```
Upvotes: -1
Views: 107
Reputation: 11
Thanks a lot, you really helped me greatly with the encoding. I had a further problem with \W in regex which doesn't seem to allow French characters. But I solved it this way instead:
with open(text_to_load, "r", encoding='utf-8') as f:
for line in f:
line = line.replace(".", " ")
line = line.replace("—", " ")
line = line.replace(",", " ")
line = line.lower()
for word in line.split():
if word in words_from_text:
words_from_text[word] = int(int(words_from_text[word]) + 1)
else:
words_from_text[word] = int("1")
Upvotes: 0