Missing French, Spanish & Germany characters in wordlist generated with findall()

Question

I'm creating a wordlist from a .txt file (with 65000 words) with the collections.counter() and findall() functions. It works well for English. However it ignores the special characters in other languages, like â, á, ü, ö etc. Furthermore I want combined words like "t'appele" and "signifie-t-elle" to be added as one distinct word. I have tried all sorts of regex combinations without success. Does someone know how to make it include the special characters? Below is my code.

with open(text_to_load) as f:
    words_from_text = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line, re.UNICODE))```

Missing French, Spanish & Germany characters in wordlist generated with findall()

Answers (1)

Related Questions

Missing French, Spanish &amp; Germany characters in wordlist generated with findall()

Answers (1)

Related Questions

Missing French, Spanish & Germany characters in wordlist generated with findall()