Diarmaid Finnerty
Diarmaid Finnerty

Reputation: 15

Naive Bayes for Text Classification - Python 2.7 Data Structure Issue

I am having an issue training my Naive Bayes Classifier. I have a feature set and targets that I want to use but I keep getting errors. I've had a look at other people who have similar problems but I can't seem to figure out the issue. I'm sure there's a simple solution but I'm yet to find it.

Here's an example of the structure of the data that I'm trying to use to train the classifier.

In [1] >> train[0]
Out[1] ({
         u'profici': [False],
         u'saver': [False],
         u'four': [True],
         u'protest': [False],
         u'asian': [True],
         u'upsid': [False],
         .
         .
         .
         u'captain': [False],
         u'payoff': [False],
         u'whose': [False]
         },
         0)

Where train[0] is the first tuple in a list and contains:

Obviously, the rest of the train list has the features and labels for the other documents that I want to classify.

When running the following code

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

MNB_clf = SklearnClassifier(MultinomialNB())
MNB_clf.train(train)

I get the error message:

  TypeError: float() argument must be a string or a number 

Edit:

features are created here. From a dataframe post_sent that contains the posts in column 1 and the sentiment classification in column 2.

  stopwords = set(stopwords.words('english'))
  tokenized = []
  filtered_posts = []
  punc_tokenizer = RegexpTokenizer(r'\w+')

  #  tokenizing and removing stopwords
   for post in post_sent.post:
      tokenized = [word.lower() for word in. 
      punc_tokenizer.tokenize(post)]
      filtered = ([w for w in tokenized if not w in stopwords])
  filtered_posts.append(filtered)    

  # stemming
  tokened_stemmed = []
  for post in filtered_posts:
      stemmed = []
  for w in post:
       stemmed.append(PorterStemmer().stem_word(w))
  tokened_stemmed.append(stemmed)   

  #frequency dist
 all_words =. 
   list(itertools.chain.from_iterable(tokened_stemmed))
   frequency = FreqDist(all_words)

  # Feature selection
  word_features = list(frequency.keys())[:3000]

   # IMPORTANT PART
   #######################
   #------ featuresets creation ---------
  def find_features(list_of_posts):
       features = {}
       wrds = set(post)
           for w in word_features:
              features[w] = [w in wrds]
  return features

  # zipping inputs with targets
  words_and_sent = zip(tokened_stemmed, 
   post_sent.sentiment)

   # IMPORTANT PART 
   ##########################
  # feature sets created here
  featuresets = [(find_features(words), sentiment) for 
   words, sentiment in 
   words_and_sent]

Upvotes: 0

Views: 488

Answers (2)

Diarmaid Finnerty
Diarmaid Finnerty

Reputation: 15

Thanks to help from both Vivek & Lenz, who explained to me the problem, I was able to reorganise my training set and thankfully it now works. Thanks guys!

The problem was very well explained in Vivek's post. This is the code that reorganised the train data into the correct format.

 features_targ = []

for feature in range(0,len(featuresets)):
   dict_test = featuresets[feature]
   values = list(itertools.chain.from_iterable(dict_test[0].values()))
   keys = dict_test[0].keys()
   target = dict_test[1]
   dict_ion = {}
   for key in range(x,len(keys)):
     dict_ion[keys[key]] = values[key]
   features_targ.append((dict_ion,target))

Upvotes: 1

Vivek Kumar
Vivek Kumar

Reputation: 36599

You are setting the train wrong. As @lenz said in comment, remove the parentheses in the feature dict values and only use single values.

As given in the official documentation:

labeled_featuresets – A list of (featureset, label) where each featureset is a dict mapping strings to either numbers, booleans or strings.

But you are setting the mapping (value of key in dict) as a list.

You correct train should look like :

[({u'profici':False,
   u'saver':False,
   u'four':True,
   u'protest':False,
   u'asian':True,
   u'upsid':False,
   .
   .
  }, 0),
     .. 
     ..
 ({u'profici':True,
   u'saver':False,
   u'four':False,
   u'protest':False,
   u'asian':True,
   u'upsid':False,
   .
   .
  }, 1)]

You can take a look at more examples here: - http://www.nltk.org/howto/classify.html

Upvotes: 0

Related Questions