Reputation: 43
I have taken a column of dataset which has description in text form for each row. I am trying to find words with tf-idf greater than some value n. but the code gives a matrix of scores how do I sort and filter the scores and see the corresponding word.
tempdataFrame = wineData.loc[wineData.variety == 'Shiraz',
'description'].reset_index()
tempdataFrame['description'] = tempdataFrame['description'].apply(lambda
x: str.lower(x))
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(tempDataFrame['description'])
Sample Data:
description
This tremendous 100% varietal wine hails from Oakville and was aged over
three years in oak. Juicy red-cherry fruit and a compelling hint of caramel
greet the palate, framed by elegant, fine tannins and a subtle minty tone in
the background. Balanced and rewarding from start to finish, it has years
ahead of it to develop further nuance. Enjoy 2022–2030.
Upvotes: 3
Views: 1864
Reputation: 4004
In the absence of a full data frame column of wine descriptions, the sample data you have provided is split in three sentences in order to create a data frame with one column named 'Description' and three rows. Then the column is passed to the tf-idf for analysis and a new data frame containing the features and their scores is created. The results are subsequently filtered using pandas.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
doc = ['This tremendous 100% varietal wine hails from Oakville and was aged over \
three years in oak.', 'Juicy red-cherry fruit and a compelling hint of caramel \
greet the palate, framed by elegant, fine tannins and a subtle minty tone in \
the background.', 'Balanced and rewarding from start to finish, it has years \
ahead of it to develop further nuance. Enjoy 2022–2030.']
df_1 = pd.DataFrame({'Description': doc})
tfidf = TfidfVectorizer(analyzer='word', stop_words = 'english')
score = tfidf.fit_transform(df_1['Description'])
# New data frame containing the tfidf features and their scores
df = pd.DataFrame(score.toarray(), columns=tfidf.get_feature_names())
# Filter the tokens with tfidf score greater than 0.3
tokens_above_threshold = df.max()[df.max() > 0.3].sort_values(ascending=False)
tokens_above_threshold
Out[29]:
wine 0.341426
oak 0.341426
aged 0.341426
varietal 0.341426
hails 0.341426
100 0.341426
oakville 0.341426
tremendous 0.341426
nuance 0.307461
rewarding 0.307461
start 0.307461
enjoy 0.307461
develop 0.307461
balanced 0.307461
ahead 0.307461
2030 0.307461
2022â 0.307461
finish 0.307461
Upvotes: 3