Reputation: 131
I'm working on a project where I need to extract important keywords from a sentence. I've been using a rules based system based on the POS tags. However, I run into some ambiguous terms that I've been unable to parse. Is there some machine learning classifier that I can use to extract relevant keywords based on a training set of different sentences?
Upvotes: 13
Views: 17107
Reputation: 934
If it's important keyword extraction from a corpus as a whole, this snippet could be helpful to extract words based on idf values. We will work with extraction of keywords in atheism category of 20 newsgroup dataset. Not your goto choice maybe :)
## THE CODE IS SELF EXPLANATORY AND COMMENTED
## loading some dependencies
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
## our dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train' , shuffle = True , categories = [ "alt.atheism" ])
## defining a stemmer to use
stemmer = SnowballStemmer("english")
## this dictiaoniary will come in handy later on ..
stemmed_to_original = {}
## Basic Preprocessings Functions ##
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result=[]
for token in gensim.utils.simple_preprocess(text) :
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
stemmed_token = lemmatize_stemming(token)
stemmed_to_original[stemmed_token] = token
result.append(stemmed_token)
return result
news_data = [ preprocess(i) for i in newsgroups_train.data ]
## notice, min_df and max_df parameters are really important in getting the most important keywords out of your corpus
vectorizer = TfidfVectorizer( stop_words= gensim.parsing.preprocessing.STOPWORDS , min_df = 20 , max_df = 0.72, tokenizer= lambda x : x , lowercase= False )
vectorizer.fit_transform( news_data )
## get idf values of all the corresponding tokens used by vectorizer and sort them in ascending order
## Depends on how you define it, but for most of cases while working in text corpus, after unnecessary stopwords and ( really high / really rare ) frequent words have been filtered out
## by parameters we used in our vectorizer above, this type of sorting gets you important keywords
## make a dictionairy of words and corresponding idf weight
word_to_idf = { i:j for i,j in zip(vectorizer.get_feature_names() , vectorizer.idf_ ) }
## sort the dictionairy in ascending order of idf weights
word_to_idf = sorted( word_to_idf.items() ,key = lambda x : x[1] , reverse = False )
print(word_to_idf)
for k,v in word_to_idf[:5]:
print( '{} ---> {} ----> {}'.format( k , stemmed_to_original[k] , v ) )
If we had been been more careful on removing headers and saluations of news, we could avoid words like post, article , host . But nevermind
post ---> posting ----> 1.4392949726265691
articl ---> article ----> 1.4754236967150747
host ---> host ----> 1.7035965964342865
nntp ---> nntp ----> 1.7248288165400607
think ---> think ----> 1.8287597393882924
peopl ---> people ----> 1.887600239411226
know ---> know ----> 1.994083719813676
univers ---> universe ----> 1.994083719813676
atheist ---> atheists ----> 2.011081296182247
like ---> like ----> 2.016811970891232
thing ---> things ----> 2.094462905121298
time ---> time ----> 2.199133527685187
mean ---> means ----> 2.2271073797275927
believ ---> believe ----> 2.2705924916673315
Upvotes: 2
Reputation: 2868
We can use gensim as well for extracting keyword from a given text
from gensim.summarization import keywords
text_en = (
'Compatibility of systems of linear constraints over the set of'
'natural numbers. Criteria of compatibility of a system of linear '
'Diophantine equations, strict inequations, and nonstrict inequations '
'are considered. Upper bounds for components of a minimal set of '
'solutions and algorithms of construction of minimal generating sets '
'of solutions for all types of systems are given. These criteria and '
'the corresponding algorithms for constructing a minimal supporting '
'set of solutions can be used in solving all the considered types of '
'systems and systems of mixed types.')
print(keywords(text_en,words = 10,scores = True, lemmatize = True))
OUTPUT Will be:
[('numbers', 0.31009020729627595),
('types', 0.2612797117033426),
('upper', 0.26127971170334247),
('considered', 0.2539581373644024),
('minimal', 0.25089449987505835),
('sets', 0.2508944998750583),
('inequations', 0.25051980840329924),
('linear', 0.2505198084032991),
('strict', 0.23778663563992564),
('diophantine', 0.23778663563992555)]
Upvotes: 4
Reputation: 1012
Try TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
This gives the keywords from the corpus. You also can get the score of the keywords, get the top n keywords etc.
Output
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
In the above output stopwords such as "is" and "the" appear because the corpus is very small. With a large corpus you can get the most important keywords in the priority order. Please check TfidfVectorizer for more clarification.
Upvotes: 3
Reputation: 647
Also try this multilingual RAKE implementation - works with any language.
Can be installed with pip install multi-rake
from multi_rake import Rake
text_en = (
'Compatibility of systems of linear constraints over the set of '
'natural numbers. Criteria of compatibility of a system of linear '
'Diophantine equations, strict inequations, and nonstrict inequations '
'are considered. Upper bounds for components of a minimal set of '
'solutions and algorithms of construction of minimal generating sets '
'of solutions for all types of systems are given. These criteria and '
'the corresponding algorithms for constructing a minimal supporting '
'set of solutions can be used in solving all the considered types of '
'systems and systems of mixed types.'
)
rake = Rake()
keywords = rake.apply(text_en)
print(keywords[:10])
# ('minimal generating sets', 8.666666666666666),
# ('linear diophantine equations', 8.5),
# ('minimal supporting set', 7.666666666666666),
# ('minimal set', 4.666666666666666),
# ('linear constraints', 4.5),
# ('natural numbers', 4.0),
# ('strict inequations', 4.0),
# ('nonstrict inequations', 4.0),
# ('upper bounds', 4.0),
# ('mixed types', 3.666666666666667)
Upvotes: 5
Reputation: 3828
Check out RAKE: It's quite a nice little Python library.
EDIT: I've also found a tutorial on how to get started with it.
Upvotes: 12