Extracting Keywords using TF-IDF

Question

I'm tackling the problem of Keyword Extraction using TF-IDF in an article . The pipeline that I follow goes as follows :

Input Text
Tokenize into sentences to build vocabulary
Apply CountVectorizer to build a count vector for each sentence .
Apply TfidfTransformer to assign weights for the same .

However , the problem I'm facing with this is that the scores I'm receiving for each token is in context with the sentence and what I want is the score of the token in context to the whole article . So how do I go about achieving that ?

For eg : This is my toy text .

"Rashid Siddiqui kept hearing those words from his fellow Muslim pilgrims lying mangled on the ground in 118-degree heat, under a searing Saudi sun. Barefoot, topless and dazed, Mr. Siddiqui had somehow escaped being crushed by the surging crowd.It was Sept. 24, 2015, the third morning of the hajj, the annual five-day pilgrimage to Mecca by millions of Muslims from around the world. By some estimates, it was the deadliest day in hajj history and one of the worst accidents in the world in decades. An American from Atlanta, Mr. Siddiqui, 42, had been walking through a sprawling valley of tens of thousands of pilgrim tents. His destination: Jamarat Bridge, where pilgrims throw pebbles at three large pillars in a ritual symbolizing the stoning of the devil. He was less than a mile from the bridge when the crush began."

And this is my weight matrix .

[[ 0.24922681  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.24922681  0.          0.
   0.          0.          0.24922681  0.24922681  0.          0.24922681
   0.24922681  0.          0.          0.24922681  0.          0.24922681
   0.24922681  0.          0.          0.          0.          0.
   0.24922681  0.          0.          0.          0.          0.20107462
   0.          0.24922681  0.          0.24922681  0.24922681  0.
   0.1669101   0.          0.          0.24922681  0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.24922681  0.          0.        ]

 [ 0.          0.22910137  0.22910137  0.          0.          0.
   0.22910137  0.          0.22910137  0.          0.          0.22910137
   0.          0.22910137  0.18483754  0.22910137  0.          0.          0.
   0.          0.          0.22910137  0.          0.          0.
   0.18483754  0.          0.          0.          0.          0.          0.
   0.          0.          0.22910137  0.          0.22910137  0.22910137
   0.18483754  0.          0.22910137  0.          0.          0.22910137
   0.          0.          0.          0.          0.          0.
   0.22910137  0.15343186  0.          0.          0.          0.22910137
   0.          0.          0.          0.          0.          0.22910137
   0.          0.          0.          0.18483754  0.        ]

 [ 0.          0.          0.          0.22910137  0.22910137  0.22910137
   0.          0.22910137  0.          0.          0.          0.          0.
   0.          0.18483754  0.          0.22910137  0.22910137  0.          0.
   0.          0.          0.22910137  0.          0.          0.18483754
   0.          0.          0.22910137  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.18483754
   0.          0.          0.          0.22910137  0.          0.          0.
   0.          0.          0.          0.          0.          0.15343186
   0.22910137  0.          0.          0.          0.          0.22910137
   0.22910137  0.22910137  0.          0.          0.22910137  0.22910137
   0.          0.18483754  0.22910137]

Now what my question is that are these weights for the token with respect to the sentence or with respect to the whole article ? If it's with respect to the sentence , then how do i make it with respect to the whole article?

What I'm trying to achieve is a kind of unsupervised technique using tfidf for extracting keywords for a single article!!

Extracting Keywords using TF-IDF

Answers (1)

Related Questions