Keshav Kumar
Keshav Kumar

Reputation: 11

Extracting keywords from documents based on a fixed list of keywords / phrases

I have a list of approximately 100 keywords and I need to search them in a huge corpus of over 0.1 million documents.

I don't want an exact match , for example if keyword is Growth Fund, I am expecting all the matches like growth funds, growth fund of america etc.

Any suggestions for this?

I have tried using spacy's PhraseMatcher but it gives an ValueError: [T001] Max length currently 10 for phrase matching.

import spacy
from spacy.matcher import PhraseMatcher

full_funds_list_flat = "<list of 100+ Keywords>"


nlp = spacy.load('en_core_web_sm')
keyword_patterns = [nlp(text) for text in full_funds_list_flat]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('KEYWORD', None, *keyword_patterns)

Upvotes: 1

Views: 1792

Answers (3)

Evan Mata
Evan Mata

Reputation: 612

There are multiple options, I would recommend first using lemmatization on your corpus. I don't know how many Named Entities you'll need to work with, so you may want to consider a specific approach for them (lemmatization will not help there - but as someone else mentioned, A in B could help, or you could add them in as individual cases into SpaCy). One other recommendation is to use tuples in a word2vec (or other text embedding) model and check for the k-most similar words to some of the words you want to avoid repeats of, and use that to inform any cases you would want to specifically check in. One other quick option for finding possible phrases to consider in the first place is to import a model (gensim has some) and just extract any phrases/words not in the model - this will likely get you alot of the named entities so you know what cases you'll have to consider.

Upvotes: 0

Naitik Chandak
Naitik Chandak

Reputation: 130

I'd recommend you to use fuzzywuzzy library of Python as you dont need exact match, it uses Levenshtein distance algorithm. Which will be more accurate to find out phrases.

reference link - https://github.com/seatgeek/fuzzywuzzy

Upvotes: 0

Born Tbe Wasted
Born Tbe Wasted

Reputation: 610

I'm currently working on something quite similar. We have multiple options, here is a quick selection:

  • Iterate using "a in b". Although quite simple, this is extremely powerful , and even though not ideal , if it is a one time check for those keywords, you can find most partial match (if plural is only 's' , "match" in "matches" == True)

  • Store your corpus in Postgresql , and use the full-text-search built in option , that is quite strong. This is heavier , but will help you if you need to iterate multiple times on the keyword , as you do the transformation only once. see : https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/

As I am not an expert, I am open to any insight , and know this might not be the best answer. But at least you have something to go on.

Upvotes: 1

Related Questions