Reputation: 320
I have 100 documents(Each document is a simple list of words in that document). Now I want to create a TF-IDF matrix so that I can create a small word search by rank. I tried it using a tfidfVectorizer but got lost in the syntax. Any help would be much appreciated. Regards.
Edit: I converted the lists into strings and added them into a parent list:
vectorizer = TfidfVectorizer(vocabulary=word_set)
matrix = vectorizer.fit_transform(doc_strings)
print(matrix)
Here word_set is the set of possible distinct words and doc_strings is a list that contains each document as a string; However when I print the matrix I get output as below :
(0, 839) 0.299458532286
(0, 710) 0.420878518454
(0, 666) 0.210439259227
(0, 646) 0.149729266143
(0, 550) 0.210439259227
(0, 549) 0.210439259227
(0, 508) 0.210439259227
(0, 492) 0.149729266143
(0, 479) 0.149729266143
(0, 425) 0.149729266143
(0, 401) 0.210439259227
(0, 332) 0.210439259227
(0, 310) 0.210439259227
(0, 253) 0.149729266143
(0, 216) 0.210439259227
(0, 176) 0.149729266143
(0, 122) 0.149729266143
(0, 119) 0.210439259227
(0, 111) 0.149729266143
(0, 46) 0.210439259227
(0, 26) 0.210439259227
(0, 11) 0.149729266143
(0, 0) 0.210439259227
(1, 843) 0.0144007295367
(1, 842) 0.0288014590734
(1, 25) 0.0144007295367
(1, 24) 0.0144007295367
(1, 23) 0.0432021886101
(1, 22) 0.0144007295367
(1, 21) 0.0288014590734
(1, 20) 0.0288014590734
(1, 19) 0.0288014590734
(1, 18) 0.0432021886101
(1, 17) 0.0288014590734
(1, 16) 0.0144007295367
(1, 15) 0.0144007295367
(1, 14) 0.0432021886101
(1, 13) 0.0288014590734
(1, 12) 0.0144007295367
(1, 11) 0.0102462376715
(1, 10) 0.0144007295367
(1, 9) 0.0288014590734
(1, 8) 0.0288014590734
(1, 7) 0.0144007295367
(1, 6) 0.0144007295367
(1, 5) 0.0144007295367
(1, 4) 0.0144007295367
(1, 3) 0.0144007295367
(1, 2) 0.0288014590734
(1, 1) 0.0144007295367
Is this correct and If so, how can I search for the rank of a given word in a particular document.
Upvotes: 4
Views: 13620
Reputation: 37741
Your code is working fine. I am giving an example with a couple of sentences. Here one sentence is equivalent to a document. Hopefully this will help you.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["welcome to stackoverflow my friend",
"my friend, don't worry, you can get help from stackoverflow"]
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(corpus)
print(matrix)
As we know that fit_transform() returns a tf-idf-weighted document-term matrix.
The print()
statement outputs the following:
(0, 2) 0.379303492809
(0, 6) 0.379303492809
(0, 7) 0.379303492809
(0, 8) 0.533097824526
(0, 9) 0.533097824526
(1, 3) 0.342619853089
(1, 5) 0.342619853089
(1, 4) 0.342619853089
(1, 0) 0.342619853089
(1, 11) 0.342619853089
(1, 10) 0.342619853089
(1, 1) 0.342619853089
(1, 2) 0.243776847332
(1, 6) 0.243776847332
(1, 7) 0.243776847332
So, how can we interpret this matrix? You can see a tuple (x, y)
and a value in each row. Here the tuple represents, document no. (in this case sentence no.) and feature no.
To better understand, lets print the list of features (in our case, features are words) and their index.
for i, feature in enumerate(vectorizer.get_feature_names()):
print(i, feature)
It outputs:
0 can
1 don
2 friend
3 from
4 get
5 help
6 my
7 stackoverflow
8 to
9 welcome
10 worry
11 you
So, welcome to stackoverflow my friend
sentence is transformed to the following.
(0, 2) 0.379303492809
(0, 6) 0.379303492809
(0, 7) 0.379303492809
(0, 8) 0.533097824526
(0, 9) 0.533097824526
For example, the first two row values can be interpreted as follows.
0 = sentence no.
2 = word index (index of the word `friend`)
0.379303492809 = tf-idf weight
0 = sentence no.
6 = word index (index of the word `my`)
0.379303492809 = tf-idf weight
From the tf-idf values, you can see, the words welcome
and to
should rank higher than the other words in sentence 1.
You can extend this example to search for the rank of a given word in a particular sentence or document to fulfill your need.
Upvotes: 19