Aspect Extraction on spaCy

Question

[Disclaimer: I also posted this question to spacy's Github discussions]

I’m trying to avoid some common pitfalls, so looking for any guidance on the best approach to perform aspect extraction given a taxonomy on spaCy. I’ve considered two approaches:

1) NER + DependencyMatcher

Entity Recognition: Make sure to have a rich set of entities for each taxonomy category. For example, for the SERVICE category, entities such as: chef, cook, waiter, waitress, waitstaff, server, bartender, barman, barwoman, etc.

Dependency Pattern Matching: Create multiple DependencyMatcher patterns to capture phrases grounded on the entities like:

“The waiter was very rude”
“Bartender is kinda obnoxious”

Example Pattern:

"ADJ_AUX_SERVICE": [
    {"RIGHT_ID": "adj", "RIGHT_ATTRS": {"POS": "ADJ", "DEP": "acomp"}},
    {"LEFT_ID": "adj", "REL_OP": "<", "RIGHT_ID": "aux", "RIGHT_ATTRS": {"POS": "AUX"}},
    {"LEFT_ID": "aux", "REL_OP": ">", "RIGHT_ID": "service", "RIGHT_ATTRS": {"ENT_TYPE": "SERVICE", "DEP": "nsubj"}}
],

I can also expand the matched tokens to include the subtrees they belong to or add modifiers (like an "advmod" on the adjective) to capture intensifiers such as “very” or “kinda.”

(Optional) Classification: Use a multiclass TextCat on the extracted patterns to apply a final category label or none and use that as a filtering for the extractions that are not useful.

Sentiment Scoring: Running a separate sentiment classification model on the matched phrases.

2) TextCat/SpanCat

Data Preparation: Keep documents short (from a sentence up to a few sentences) and train a multilabel TextCat or SpanCat model based on the available annotations.

For this approach, I do have some extra questions:

I’m currently using the en_core_web_trf pipeline, which includes a Tagger, Parser, Lemmatizer, and NER all sharing the same transformer.

If I add a TextCat or SpanCat, should I:

Train it using a non-transformer model?
Or use a transformer?

If I opt for the transformer approach, should I:

Use a replace_listener so that the new component gets its own transformer (at the expense of a heftier pipeline)?
Or add it as a listener to the existing transformer?

If added as a listener, my understanding is that there are two main options:

Freeze the other components (fine-tuning only the last head, which would hurt performance of all the frozen components)
Train everything together.
Since my training datasets lack Tag, Dep, Lemma, and NER annotations, is it possible to generate these automatically (via en_core_web_trf) during the SpanCat/TextCat Prodigy annotation process so that my dataset would have everything? And if so, is this desirable anyway?

Sentiment Scoring:

Could use "Positive", "Negative", and "Neutral" as part of the multi labels, effectively combining the Aspect Classification and the Sentiment Classification labels in one.
Have a second TextCat after this TextCat/SpanCat only for the sentiment labels on the aspect-labeled spans.

3) NER/SpanCat + Relation

If I'm reading this correctly it looks like this could be a viable approach if this NER is treated as a SpanCat for the "aspect" and then the relation label allows for more nuance extraction. Continuing my above example and making it more complex:

                 /----------------------------------FOOD QUALITY----------------------v      
                /--------SERVICE QUALITY--------v                                     v
"Restaurant [ABC:ENTITY] has [very rude staff:ASPECT] but their [pizzas are an out of body experience:ASPECT]"

Sentiment Scoring: Following this model with a TextCat for the sentiment classification on the aspect-labeled entities/spans.

Aspect Extraction on spaCy

1) NER + DependencyMatcher

2) TextCat/SpanCat

3) NER/SpanCat + Relation

Answers (0)

Related Questions