Reputation: 85
I am trying to classify email as spam/ham using NLTK
Below are the steps followed :
Trying to extract all the tokens
Fetching all the features
Extracting features from the corpus of all unique words and mapping True/false
from nltk.classify.util import apply_features
from nltk import NaiveBayesClassifier
import pandas as pd
import collections
from sklearn.model_selection import train_test_split
from collections import Counter
data = pd.read_csv('https://raw.githubusercontent.com/venkat1017/Data/master/emails.csv')
"""fetch array of tuples where each tuple is defined by (tokenized_text, label)
"""
processed_tokens=data['text'].apply(lambda x:([x for x in x.split() if x.isalpha()]))
processed_tokens=processed_tokens.apply(lambda x:([x for x in x if len(x)>3]))
processed_tokens = [(i,j) for i,j in zip(processed_tokens,data['spam'])]
"""
dictword return a Set of unique words in complete corpus.
"""
list = zip(*processed_tokens)
dictionary = Counter(word for i, j in processed_tokens for word in i)
dictword = [word for word, count in dictionary.items() if count == 1]
"""maps each input text into feature vector"""
y_dict = ( [ (word, True) for word in dictword] )
feature_vec=dict(y_dict)
"""Training"""
training_set, testing_set = train_test_split(y_dict, train_size=0.7)
classifier = NaiveBayesClassifier.train(training_set)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\classify\naivebayes.py in train(cls, labeled_featuresets, estimator)
197 for featureset, label in labeled_featuresets:
198 label_freqdist[label] += 1
--> 199 for fname, fval in featureset.items():
200 # Increment freq(fval|label, fname)
201 feature_freqdist[label, fname][fval] += 1
AttributeError: 'str' object has no attribute 'items'
I am facing with the following error when trying to train the corpus of unique words
Upvotes: 0
Views: 328
Reputation: 515
Firstly, I hope you're aware that y_dict
is just a dictionary which maps words (strings) which have occurred only once in the corpus as keys to the value True
. You're passing it as a training set to the classifier, whereas you ought be a passing a tuple
of (feature dict of each text row), and (the corresponding label). While your classifier should be receiving [({'feat1': 'value1', ... }, label_value), ...]
as input, you're passing [ ('word1', True), ... ]
. The string
type has no items
attribute, only dict
does. Hence the error.
Secondly, your data modelling is wrong. Your training set should consist of a feature dict derived from data['text']
mapped to the data['spam']
value (since that is your label). Please look at how to perform document classification with nltk's classsifiers in section 1.3 here.
Upvotes: 1