Keithx
Keithx

Reputation: 3148

Vectorizer the combination of words in Python

I have a dataset with medical text data and I apply tf-idf vectorizer on them and calculate tf idf score for the words just like this:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as tf

vect = tf(min_df=60,stop_words='english')

dtm = vect.fit_transform(df) 
l=vect.get_feature_names() 

x=pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

So basically my question is following-while I'm applying TfidfVectorizer it splits the text in distinct words for example: "pain", "headache", "nausea" and so on. How can I get the words combination in the output of TfidfVectorizer for example: "severe pain", "cluster headache", "nausea vomiting". Thanks

Upvotes: 4

Views: 665

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

Use ngram_range parameter:

vect = tf(min_df=60, stop_words='english', ngram_range=(1,2))

or (depending on your goals):

vect = tf(min_df=60, stop_words='english', ngram_range=(2,2))

Upvotes: 5

Related Questions