Python Character Encoding Discrepancy

Question

I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. Looking at the string there utf-8 chars appear such as \xe2\x80\xa8 (space),\xe2\x82\xac (Euro Currency Sign). 1. Is this the cause of the error? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame?

coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")

import os
import pandas as pd

path = '//directory1//'

data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('
', ' '))

df = pd.DataFrame(data, columns=["message"])

df["label"] = "1"

path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('
', ' '))

df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"

messages = pd.concat([df,df2], ignore_index=True)

import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word

tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora

terms = tfidf_vectorizer.get_feature_names()

totalvocab_tokenized = []

for i in emails.message:
    # x = emails.message[i].decode('utf-8')
    x = unicode(emails.message[i], errors="replace")
    allwords_tokenized = tokenize_only(x)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)

I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error.

Any ideas?

Thanks!

Python Character Encoding Discrepancy

Answers (1)

Related Questions