Reputation: 591
I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.
Does anyone know if what I want to do is possible?
Upvotes: 4
Views: 659
Reputation: 492
If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.
For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.
A Brief Look at TF-IDF
Code Implementation
This has a fit_transform
function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks
and doc.ents
that satisfy len(span) >= 2
(i.e. phrases), there is a little hack for the TfidfVectorizer
.
To use your own tokenization, do the following:
dummy = lambda x: x
vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
This overrides the default tokenization and lets you use your own list of tokens.
From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.
Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.
Upvotes: 1
Reputation: 627
if you want exact phrases to be recognised, you can compile a list of those phrases and use spaCy's PhraseMatcher component to train and recognise it later.
https://spacy.io/usage/rule-based-matching#phrasematcher
The only thing is it will only recognise the exact phrases supplied to it. This is in contrary to how NER works, it can recognise additional phrases based on training provided , but PhraseMatcher will only recognise the ones you provide it.
Upvotes: 0