Steve
Steve

Reputation: 505

Identifying the subject of a sententce

I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".

My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.

  1. Our account manager is great, he always goes the extra mile!
  2. Steve our account manager is great, he always goes the extra mile!
  3. Steve our relationship manager is great, he always goes the extra mile!
  4. Steven is great, he always goes the extra mile!
  5. Steve Smith is great, he always goes the extra mile!
  6. Our business mgr. is great,he always goes the extra mile!

I see three challenges that add complexity to my task

  1. Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
  2. Abbreviations: Mgr. vs manager
  3. Ambiguity - Whether “Steven” is “Steve Smith” & therefore an “account manager”.

Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.

Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?

Upvotes: 5

Views: 1334

Answers (3)

Tobias
Tobias

Reputation: 424

As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.

Based on this assumption, here are a few techniques and how they might help you:

(Dependency) Parsing

Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".

Named Entity Recognition

You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:

enter image description here

Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.

Coreference Resolution

Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.

In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:

enter image description here

The visualization is a bit messy, but it basically found the following coreference chains:

  • Steve <-> Steve Smith
  • Steve our account manager <-> He <-> Our account manager
  • Our <-> Our
  • the extra mile <-> the extra mile

Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.

Semantic Similarity

In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.

More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.

Some approaches are:

  • Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
  • Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
  • Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
  • Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)

An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.

Upvotes: 2

Daniel
Daniel

Reputation: 6039

I like your approach using NER. This is what I see in our system for your inputs: enter image description here

Mention-Detection output might also be useful: enter image description here

On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.

About your 3rd point, co-reference problem, try the coref module: enter image description here

Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080

Upvotes: 1

Raymond Chen
Raymond Chen

Reputation: 429

If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).

Upvotes: 1

Related Questions