user2737152
user2737152

Reputation: 35

Tokenizing in french using nltk

I am trying to tokenize french words but when i tokenize french words the words which contain "^" symbol returns \xe .The following is the code that i implemented .

import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import SpaceTokenizer
from nltk.tokenize import RegexpTokenizer
data = "Vous êtes au volant d'une voiture et vous roulez à vitesse"
#wst = WhitespaceTokenizer()
#tokenizer = RegexpTokenizer('\s+', gaps=True)
token=WhitespaceTokenizer().tokenize(data)
print token

Output i got

['Vous', '\xeates', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', '\xe0', 'vitesse']

Desired output

['Vous', 'êtes', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', 'à', 'vitesse']

Upvotes: 3

Views: 5264

Answers (3)

Quentin Pradet
Quentin Pradet

Reputation: 4771

In Python 2, to write UTF-8 text in your code, you need to start your file with # -*- coding: <encoding name> -*- when not using ASCII. You also need to prepend Unicode strings with u:

# -*- coding: utf-8 -*-

import nltk
...

data = u"Vous êtes au volant d'une voiture et vous roulez à grande vitesse"
print WhitespaceTokenizer().tokenize(data)

When you're not writing data in your Python code but reading it from a file, you must make sure that it's properly decoded by Python. The codecs module helps here:

import codecs

codecs.open('fichier.txt', encoding='utf-8')

This is good practice because if there is an encoding error, you will know about it right away: it won't bite you later on, eg. after processing your data. This is also the only approach that works in Python 3, where codecs.open becomes open and decoding is always done right away. More generally, avoid the 'str' Python 2 type like the plague and always stick with Unicode strings to make sure encoding is done properly.

Recommended readings:

Bon courage !

Upvotes: 4

alvas
alvas

Reputation: 122270

You don't really need the whitespace tokenizer for French if it's a simple sentence where tokens are naturally delimited by spaces. If not the nltk.tokenize.word_tokenize() would serve you better.

See How to print UTF-8 encoded text to the console in Python < 3?

# -*- coding: utf-8 -*-

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

sentence = "Vous êtes au volant d'une voiture et vous roulez à grande $3.88 vitesse"
print sentence.split()

from nltk.tokenize import word_tokenize
print word_tokenize(sentence)

from nltk.tokenize import wordpunct_tokenize
print wordpunct_tokenize(sentence)

Upvotes: 0

arturomp
arturomp

Reputation: 29630

Take at the section "3.3 Text Processing with Unicode" in Chapter 3 of NTLK.

Make sure that your string is prepended with a u and you should be ok. Also note from that chapter that, as @tripleee suggested:

There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts installed on your system.

Upvotes: 0

Related Questions