Reputation: 591

Train Spacy on unlabeled text corpus to extract "important phrases"

I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.

Does anyone know if what I want to do is possible?

Upvotes: 4

Answers (2)

Branden Ciranni

Reputation: 492

If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.

For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.

A Brief Look at TF-IDF

TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
It is commonly used as a statistical measure to determine how important terms are in a corpus.
For a longer, but readable explanation of it, check out the wiki: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Code Implementation

Check out Scikit-Learn's TfidfVectorizer.
- This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
- If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.
- To use your own tokenization, do the following:
```
dummy = lambda x: x

vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
```
  This overrides the default tokenization and lets you use your own list of tokens.

From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.

Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.

Upvotes: 1

Faizan Naseer

Reputation: 627

if you want exact phrases to be recognised, you can compile a list of those phrases and use spaCy's PhraseMatcher component to train and recognise it later.

https://spacy.io/usage/rule-based-matching#phrasematcher

The only thing is it will only recognise the exact phrases supplied to it. This is in contrary to how NER works, it can recognise additional phrases based on training provided , but PhraseMatcher will only recognise the ones you provide it.

Upvotes: 0

Train Spacy on unlabeled text corpus to extract &quot;important phrases&quot;

Answers (2)

Related Questions

Train Spacy on unlabeled text corpus to extract "important phrases"