Daniel Svoboda
Daniel Svoboda

Reputation: 131

Best way to extract keywords from input NLP sentence

I'm working on a project where I need to extract important keywords from a sentence. I've been using a rules based system based on the POS tags. However, I run into some ambiguous terms that I've been unable to parse. Is there some machine learning classifier that I can use to extract relevant keywords based on a training set of different sentences?

Upvotes: 13

Views: 17107

Answers (5)

bad programmer
bad programmer

Reputation: 934

If it's important keyword extraction from a corpus as a whole, this snippet could be helpful to extract words based on idf values. We will work with extraction of keywords in atheism category of 20 newsgroup dataset. Not your goto choice maybe :)

## THE CODE IS SELF EXPLANATORY AND COMMENTED 

## loading some dependencies
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer

## our dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train' , shuffle = True , categories =  [ "alt.atheism" ])
## defining a stemmer to use
stemmer = SnowballStemmer("english")

## this dictiaoniary will come in handy later on ..
stemmed_to_original = {}

## Basic Preprocessings Functions ##
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :

        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            stemmed_token = lemmatize_stemming(token)
            stemmed_to_original[stemmed_token] = token
            result.append(stemmed_token)
            
    return result


news_data = [ preprocess(i) for i in newsgroups_train.data  ]
## notice, min_df and max_df parameters are really important in getting the most important keywords out of your corpus
vectorizer = TfidfVectorizer(   stop_words= gensim.parsing.preprocessing.STOPWORDS , min_df = 20 , max_df = 0.72, tokenizer= lambda x : x , lowercase= False   )
vectorizer.fit_transform( news_data  )

## get idf values of all the corresponding tokens used by vectorizer and sort them in ascending order
## Depends on how you define it, but for most of cases while working in text corpus,  after unnecessary stopwords and  ( really high / really rare ) frequent words have been filtered out
## by parameters we used in our vectorizer above,  this type of sorting gets you important keywords

## make a dictionairy of words and corresponding idf weight
word_to_idf = {  i:j for i,j in zip(vectorizer.get_feature_names() , vectorizer.idf_ ) }
## sort the dictionairy in ascending order of idf weights
word_to_idf = sorted(   word_to_idf.items() ,key = lambda x : x[1]  ,  reverse = False )
print(word_to_idf)

LETS PRINT THE TOP N RESULTS

for k,v in word_to_idf[:5]:
    print( '{} ---> {} ----> {}'.format( k , stemmed_to_original[k] , v    )  ) 

Let's look at top Results

If we had been been more careful on removing headers and saluations of news, we could avoid words like post, article , host . But nevermind

post ---> posting ----> 1.4392949726265691
articl ---> article ----> 1.4754236967150747
host ---> host ----> 1.7035965964342865
nntp ---> nntp ----> 1.7248288165400607
think ---> think ----> 1.8287597393882924
peopl ---> people ----> 1.887600239411226
know ---> know ----> 1.994083719813676
univers ---> universe ----> 1.994083719813676
atheist ---> atheists ----> 2.011081296182247
like ---> like ----> 2.016811970891232
thing ---> things ----> 2.094462905121298
time ---> time ----> 2.199133527685187
mean ---> means ----> 2.2271073797275927
believ ---> believe ----> 2.2705924916673315

Upvotes: 2

qaiser
qaiser

Reputation: 2868

We can use gensim as well for extracting keyword from a given text

from gensim.summarization import keywords


 text_en = (
    'Compatibility of systems of linear constraints over the set of'
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.')

print(keywords(text_en,words = 10,scores = True, lemmatize = True))

OUTPUT Will be:

[('numbers', 0.31009020729627595),
('types', 0.2612797117033426),
('upper', 0.26127971170334247),
('considered', 0.2539581373644024),
('minimal', 0.25089449987505835),
('sets', 0.2508944998750583),
('inequations', 0.25051980840329924),
('linear', 0.2505198084032991),
('strict', 0.23778663563992564),
('diophantine', 0.23778663563992555)]

Upvotes: 4

Kabilesh
Kabilesh

Reputation: 1012

Try TfidfVectorizer from sklearn

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

This gives the keywords from the corpus. You also can get the score of the keywords, get the top n keywords etc.

Output

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In the above output stopwords such as "is" and "the" appear because the corpus is very small. With a large corpus you can get the most important keywords in the priority order. Please check TfidfVectorizer for more clarification.

Upvotes: 3

v.grabovets
v.grabovets

Reputation: 647

Also try this multilingual RAKE implementation - works with any language.

Can be installed with pip install multi-rake

from multi_rake import Rake

text_en = (
    'Compatibility of systems of linear constraints over the set of '
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.'
)

rake = Rake()

keywords = rake.apply(text_en)

print(keywords[:10])

#  ('minimal generating sets', 8.666666666666666),
#  ('linear diophantine equations', 8.5),
#  ('minimal supporting set', 7.666666666666666),
#  ('minimal set', 4.666666666666666),
#  ('linear constraints', 4.5),
#  ('natural numbers', 4.0),
#  ('strict inequations', 4.0),
#  ('nonstrict inequations', 4.0),
#  ('upper bounds', 4.0),
#  ('mixed types', 3.666666666666667)

Upvotes: 5

errantlinguist
errantlinguist

Reputation: 3828

Check out RAKE: It's quite a nice little Python library.

EDIT: I've also found a tutorial on how to get started with it.

Upvotes: 12

Related Questions