anon
anon

Reputation: 866

Problems while TFIDF vectorizing tokenized documents?

I am vectorizing a text blob with tokens that have the following style:

hi__(how are you), 908__(number code), the__(POS)

As you can see the tokens have attached some information with __(info), I am extracting key words using tfidf, as follows:

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(doc)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()

The problem is that when I do the above procedure for extracting keywords, I am suspecting that the vectorizer object is removing the parenthesis from my textblob. Thus, which parameter from the tfidf vectorizer object can I use in order to preserve such information in the parenthesis?

UPDATE

I also tried to:

from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None)  

and

from sklearn.feature_extraction.text import TfidfVectorizer

def dummy_fun(doc):
    return doc

tfidf = TfidfVectorizer(
    tokenizer=dummy_fun,
    preprocessor=dummy_fun,
    token_pattern=None) 

However, this returns me a sequence of characters instead of tokens that I already tokenize:

['e', 's', '_', 'a', 't', 'o', 'c', 'r', 'i', 'n']

Upvotes: 1

Views: 2328

Answers (1)

acattle
acattle

Reputation: 3113

The problem is that default tokenization used by TfidfVectorizer explicitly ignores all punctuation:

token_pattern : string

Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

Your problem is related to this previous question but instead of treating punctuation as separate tokens, you want prevent token__(info) from splitting the token. In both cases, the solution is to write a custom token_pattern, although exact patterns are different.

Assuming every token already has __(info) attached:

vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+__\([\w\s]*\)')
X = vectorizer.fit_transform(doc)

I simply modified the default token_pattern so it now matches any 2 or more alphanumeric characters followed by __(, 0 or more alphanumeric or whitespace characters, and ending with a ). If you want more information on how to write your own token_pattern, see the Python doc for regular expressions.

Upvotes: 2

Related Questions