OAK
OAK

Reputation: 3166

Python Character Encoding Discrepancy

I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. Looking at the string there utf-8 chars appear such as \xe2\x80\xa8 (space),\xe2\x82\xac (Euro Currency Sign). 1. Is this the cause of the error? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame?

coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")

import os
import pandas as pd

path = '//directory1//'

data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df = pd.DataFrame(data, columns=["message"])

df["label"] = "1"

path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"

messages = pd.concat([df,df2], ignore_index=True)

import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word

tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora

terms = tfidf_vectorizer.get_feature_names()

totalvocab_tokenized = []

for i in emails.message:
    # x = emails.message[i].decode('utf-8')
    x = unicode(emails.message[i], errors="replace")
    allwords_tokenized = tokenize_only(x)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)

I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error.

Any ideas?

Thanks!

Upvotes: 0

Views: 76

Answers (1)

Alastair McCormack
Alastair McCormack

Reputation: 27704

  1. It looks like you're printing a repr() of the data. If UTF-8 can't be printed, Python may choose to escape it. Print the actual string or Unicode

  2. Get rid of the sys.setdefaultencoding("utf8") and sys reload - it masks issues. If you get new exceptions, let's investigate those.

  3. Open you text files with automatic decoding. Assuming your input is UTF-8:

    with io.open(path+f, "r", encoding="utf-8") as myfile:
    

Upvotes: 1

Related Questions