Reputation: 178
I read a file with the code below and then I want to find words in the file using re library. The file contains Turkish characters. So I decode file using utf-8. re library doesn't know Turkish character. Below code isn't working.
text= unicodedata.normalize("NFKD",codecs.open(os.path.abspath("texts/kopru1.txt"),"rb").read().decode("utf-8"))
text=text.replace("\r\n"," ").lower()
aa= re.findall(ur"[a-zçşıöü]+", text,re.UNICODE)
Although "ayşe" is a word, this word seems as of "ays" and "e".
Upvotes: 2
Views: 117
Reputation: 15028
Use the escape sequence \w
which means "a letter of any kind." Just getting an example sentence from wikipedia:
>>> text = u'Türkî-i çin (güzel güneş) terkiplerinde de gördüğümüz'
>>> re.findall(r'\w+', text, re.UNICODE)
['Türkî', 'i', 'çin', 'güzel', 'güneş', 'terkiplerinde', 'de', 'gördüğümüz']
Upvotes: 5