JayGatsby
JayGatsby

Reputation: 1621

finding ngrams with nltk in turkish text

I'm trying to find ngrams in a Turkish text which has unicode characters. Here is my code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import nltk
from nltk import word_tokenize
from nltk.util import ngrams

def find_bigrams():
    t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı"
    token = nltk.word_tokenize(t)
    bigrams = ngrams(token,2)
    for i in bigrams:
        print i

find_bigrams()

output:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7: ordinal not in range(128)

When I change the text like this:

t = "çağlar boyunca geldik çağlar aktı gitti"

output also changes:

('\xc3\xa7a\xc4\x9flar', 'boyunca')
('boyunca', 'geldik')
('geldik', '\xc3\xa7a\xc4\x9flar')
('\xc3\xa7a\xc4\x9flar', 'akt\xc4\xb1')
('akt\xc4\xb1', 'gitti')

How can I solve this unicode problem? And the other question is how can I convert these tokens to string (without ') chars)

Upvotes: 1

Views: 2787

Answers (1)

erip
erip

Reputation: 16945

This isn't so much of a NLTK problem as a unicode problem.

This can be solved by adding the right import from __future__; in this case, you need unicode_literals.

Note this example from my Mac's install of Python 2.7.10:

>>> from __future__ import unicode_literals
>>> t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı"
>>> print(t)
çağlar boyunca geldik çağlar aktı gitti. çağlar aktı

bigrams is a list of tuples, so to remove the parens, you can iterate over each pair in the list.

>>> tup = ("hello", "world")
>>> print tup
(u'hello', u'world')
>>> l = [tup]
>>> for i in l:
...   print(i)
... 
(u'hello', u'world')
>>> for i,j in l:
...   print("{0} {1}".format(i, j))
... 
hello world

Combining these ideas in your script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk
from nltk import word_tokenize
from nltk.util import ngrams

def find_bigrams():
    t = "çağlar boyunca geldik çağlar aktı gitti. çağlar aktı"
    token = nltk.word_tokenize(t)
    bigrams = ngrams(token,2)
    for i, j in bigrams:
        print("{0} {1}".format(i, j))

find_bigrams()

Upvotes: 3

Related Questions