James Hamilton
James Hamilton

Reputation: 457

TfidfVectorizer gives high weight to stop words

Given the following code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import urllib.request  # the lib that handles the url stuff
from bs4 import BeautifulSoup
import unicodedata

def remove_control_characters(s):
    base = ""
    for ch in s:
        if unicodedata.category(ch)[0]!="C":
            base = base + ch.lower()
        else:
            base = base + " "
    return base 

moby_dick_url='http://www.gutenberg.org/files/2701/2701-0.txt'

soul_of_japan = 'http://www.gutenberg.org/files/12096/12096-0.txt'

def extract_body(url):
    with urllib.request.urlopen(url) as s:
        data = BeautifulSoup(s).body()[0].string
        stripped = remove_control_characters(data)
        return stripped

moby = extract_body(moby_dick_url)    
bushido = extract_body(soul_of_japan)

corpus = [moby,bushido]

vectorizer = TfidfVectorizer(use_idf=False, smooth_idf=True)
tf_idf = vectorizer.fit_transform(corpus)
df_tfidf = pd.DataFrame(tf_idf.toarray(), columns=vectorizer.get_feature_names(), index=["Moby", "Bushido"])
df_tfidf[["the", "whale"]]

I would expect "whale" to be given a relatively high tf-idf in "Moby Dick", but a low score in "Bushido: The Soul of Japan", and "the" to be given a low score in both. However, I get the opposite. The results that are calculated are:

|       |     the   | whale    |
|-------|-----------|----------|
|Moby   | 0.707171  | 0.083146 |
|Bushido| 0.650069  | 0.000000 |

Which makes no sense to me. Can anyone point to the mistake in either thinking or coding that I have made?

Upvotes: 3

Views: 943

Answers (1)

MaximeKan
MaximeKan

Reputation: 4211

There are two reasons why you are observing this.

  • The first is because of the parameters you passed to your Tfidf Vectorizer. You should be doing TfidfVectorizer(use_idf=True, ...), because it is the idf part of the tfidf (remember that tf-idf is the product of term frequency and inverse document frequency) that will penalize words that appear in all documents. By setting TfidfVectorizer(use_idf=False, ..), you are just considering the term frequency part, which obviously leads to stopwords having a larger score

  • The second is because of your data. Suppose you fixed the code problem above, your corpus is still very very small, just two documents. This means that any word that appears in both books will get penalized the same way. "courage" might appear in both books, just as "the", and so given they both appear in each document of your corpus, their idf value will be the same, causing stopwords to again have a larger score because of their larger term-frequency

Upvotes: 2

Related Questions