hinzir
hinzir

Reputation: 178

python unicode regular expressions

I read a file with the code below and then I want to find words in the file using re library. The file contains Turkish characters. So I decode file using utf-8. re library doesn't know Turkish character. Below code isn't working.

    text= unicodedata.normalize("NFKD",codecs.open(os.path.abspath("texts/kopru1.txt"),"rb").read().decode("utf-8"))
    text=text.replace("\r\n"," ").lower()
    aa= re.findall(ur"[a-zçşıöü]+", text,re.UNICODE)  

Although "ayşe" is a word, this word seems as of "ays" and "e".

Upvotes: 2

Views: 117

Answers (1)

kqr
kqr

Reputation: 15028

Use the escape sequence \w which means "a letter of any kind." Just getting an example sentence from wikipedia:

>>> text = u'Türkî-i çin (güzel güneş) terkiplerinde de gördüğümüz'
>>> re.findall(r'\w+', text, re.UNICODE)
['Türkî', 'i', 'çin', 'güzel', 'güneş', 'terkiplerinde', 'de', 'gördüğümüz']

Upvotes: 5

Related Questions