Reputation: 1
I'm tackling the problem of Keyword Extraction using TF-IDF in an article . The pipeline that I follow goes as follows :
However , the problem I'm facing with this is that the scores I'm receiving for each token is in context with the sentence and what I want is the score of the token in context to the whole article . So how do I go about achieving that ?
For eg : This is my toy text .
"Rashid Siddiqui kept hearing those words from his fellow Muslim pilgrims lying mangled on the ground in 118-degree heat, under a searing Saudi sun. Barefoot, topless and dazed, Mr. Siddiqui had somehow escaped being crushed by the surging crowd.It was Sept. 24, 2015, the third morning of the hajj, the annual five-day pilgrimage to Mecca by millions of Muslims from around the world. By some estimates, it was the deadliest day in hajj history and one of the worst accidents in the world in decades. An American from Atlanta, Mr. Siddiqui, 42, had been walking through a sprawling valley of tens of thousands of pilgrim tents. His destination: Jamarat Bridge, where pilgrims throw pebbles at three large pillars in a ritual symbolizing the stoning of the devil. He was less than a mile from the bridge when the crush began."
And this is my weight matrix .
[[ 0.24922681 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.24922681 0. 0.
0. 0. 0.24922681 0.24922681 0. 0.24922681
0.24922681 0. 0. 0.24922681 0. 0.24922681
0.24922681 0. 0. 0. 0. 0.
0.24922681 0. 0. 0. 0. 0.20107462
0. 0.24922681 0. 0.24922681 0.24922681 0.
0.1669101 0. 0. 0.24922681 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.24922681 0. 0. ]
[ 0. 0.22910137 0.22910137 0. 0. 0.
0.22910137 0. 0.22910137 0. 0. 0.22910137
0. 0.22910137 0.18483754 0.22910137 0. 0. 0.
0. 0. 0.22910137 0. 0. 0.
0.18483754 0. 0. 0. 0. 0. 0.
0. 0. 0.22910137 0. 0.22910137 0.22910137
0.18483754 0. 0.22910137 0. 0. 0.22910137
0. 0. 0. 0. 0. 0.
0.22910137 0.15343186 0. 0. 0. 0.22910137
0. 0. 0. 0. 0. 0.22910137
0. 0. 0. 0.18483754 0. ]
[ 0. 0. 0. 0.22910137 0.22910137 0.22910137
0. 0.22910137 0. 0. 0. 0. 0.
0. 0.18483754 0. 0.22910137 0.22910137 0. 0.
0. 0. 0.22910137 0. 0. 0.18483754
0. 0. 0.22910137 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.18483754
0. 0. 0. 0.22910137 0. 0. 0.
0. 0. 0. 0. 0. 0.15343186
0.22910137 0. 0. 0. 0. 0.22910137
0.22910137 0.22910137 0. 0. 0.22910137 0.22910137
0. 0.18483754 0.22910137]
Now what my question is that are these weights for the token with respect to the sentence or with respect to the whole article ? If it's with respect to the sentence , then how do i make it with respect to the whole article?
What I'm trying to achieve is a kind of unsupervised technique using tfidf for extracting keywords for a single article!!
Upvotes: 0
Views: 2440
Reputation: 261
TfidfVectorizer is equivalent to Applying a CountVectorizer and then TfidfTransformer as given here. If i understood you correctly, you passed an article and it returned a matrix of weight vectors but it would only happen if you divided the article into sentences or so. If it just one article you passed, it would return a sparse row. Here is a sample python notebook which I made should help you.
Upvotes: 1