pawelty
pawelty

Reputation: 1000

classify new document - Random Forest, Bag of Words

This is my first attempt of document classification with ML and Python.

  1. I first query my database to extract 5000 articles related to money laundering and convert them to pandas df
  2. Then I extract 500 articles not related to money laundering and also convert them to pandas df
  3. I concatenate both dfs and label them either 'money-laundering' or 'other'
  4. I do preprocessing (removing punctuation and stopwords, lower case etc)
  5. and then feed the model based on bag of words principle as below:

    vectorizer = CountVectorizer(analyzer = "word",   
                         tokenizer = None,    
                         preprocessor = None, 
                         stop_words = None,   
                         max_features = 5000) 
    
    text_features = vectorizer.fit_transform(full_df["processed full text"])    
    text_features = text_features.toarray()    
    labels = np.array(full_df['category'])
    X_train, X_test, y_train, y_test = train_test_split(text_features, labels, test_size=0.33)    
    forest = RandomForestClassifier(n_estimators = 100)     
    forest = forest.fit(X_train, y_train)    
    y_pred = forest.predict(X_test)    
    accuracy_score(y_pred=y_pred, y_true=y_test)
    

It works fine until now (even though gives me too high accuracy 99%). But I would like to test it on a completely new text document now. If I vectorize it and do forest.predict(test) it obviously says:

ValueError: Number of features of the model must  match the input. Model n_features is 5000 and  input n_features is 45 

I am not sure how to overcome this to be able to classify totally new article.

Upvotes: 1

Views: 2479

Answers (2)

pawelty
pawelty

Reputation: 1000

My first implementation of Naive Bayes was from Text Blob library. It was extremely slow and my machine eventually run out of memory.

The second try was based on this article http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html and used MultinomialNB from sklearn.naive_bayes library. And it worked liked a charm:

#initialize vectorizer
count_vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 5000)
counts = count_vectorizer.fit_transform(df['processed full text'].values)
targets = df['category'].values

#divide into train and test sets
X_train, X_test, y_train, y_test = train_test_split(counts, targets, test_size=0.33)

#create classifier
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

#check accuracy
y_pred = classifier.predict(X_test)
accuracy_score(y_true=y_test, y_pred=y_pred)

#check on completely new example
new_counts = count_vectorizer.transform([processed_test_string])
prediction = classifier.predict(new_counts)
prediction

output:

array(['money laundering'], 
      dtype='<U16')

And the accuracy is around 91% so more realistic than 99.96%..

Exactly what I wanted. Would be also nice to see the most informative features, I will try to work it out. Thanks everyone.

Upvotes: 1

probaPerception
probaPerception

Reputation: 591

First of all, even though my proposition may work, I strongly emphasize the fact that this solution has some statistical and computational consequences that you would need to understand before running this code. Let assume you have an initial corpus of texts full_df["processed full text"] and test is the new text you would like to test. Then, let define full_added the corpus of texts with full_df and test.

text_features = vectorizer.fit_transform(full_added)    
text_features = text_features.toarray()    

You could use full_df as your train set (X_train = full_df["processed full text"] and y_train = np.array(full_df['category'])). And then you can run

forest = RandomForestClassifier(n_estimators = 100)     
forest = forest.fit(X_train, y_train)    
y_pred = forest.predict(test)    

Of course, in this solution, you have already defined your parameters and you consider your model robust on new data.

Another remark is that if you have a stream of new texts as input that you would like to analyze, this solution would be dreadful since the computational time of computing a new vectorizer.fit_transform(full_added) would increase dramatically.

I hope it helps.

Upvotes: 2

Related Questions