Reputation: 3166
I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. Looking at the string there utf-8 chars appear such as \xe2\x80\xa8 (space),\xe2\x82\xac (Euro Currency Sign). 1. Is this the cause of the error? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame?
coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import os
import pandas as pd
path = '//directory1//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
with open(path+f, "r") as myfile:
data.append(myfile.read().replace('\n', ' '))
df = pd.DataFrame(data, columns=["message"])
df["label"] = "1"
path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
with open(path+f, "r") as myfile:
data.append(myfile.read().replace('\n', ' '))
df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"
messages = pd.concat([df,df2], ignore_index=True)
import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = nltk.corpus.stopwords.words('english')
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word
tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora
terms = tfidf_vectorizer.get_feature_names()
totalvocab_tokenized = []
for i in emails.message:
# x = emails.message[i].decode('utf-8')
x = unicode(emails.message[i], errors="replace")
allwords_tokenized = tokenize_only(x)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)
I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error.
Any ideas?
Thanks!
Upvotes: 0
Views: 76
Reputation: 27704
It looks like you're printing a repr()
of the data. If UTF-8 can't be printed, Python may choose to escape it. Print the actual string or Unicode
Get rid of the sys.setdefaultencoding("utf8")
and sys
reload - it masks issues. If you get new exceptions, let's investigate those.
Open you text files with automatic decoding. Assuming your input is UTF-8:
with io.open(path+f, "r", encoding="utf-8") as myfile:
Upvotes: 1