Python Named Entities Recognition finding a specific entity

Question

I currently have a project about NLP, I try to use NLTK to recognize a PERSON name. But, the problem is more challenging than just finding part-of-speech.

"input = "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."

So, the challenge is: I just want to get the attorney's name as the return from the whole document, not the other person, so "John Smith", part-of-speech: PERSON, occupation: attorney. The return could look like this, or just "John Smith".

{ 
 "name": "John Smith",
 "type": "PERSON",
 "occupation": "attorney"
}

I have tried NLTK part-of-speech, also the Google Cloud Natural Language API, but it just helped me to detect the PERSON name. How can I detect if it is an attorney? Please guide me to the right approach. Do I have to train my own data or corpus to detect "attorney". I have thousands of court document txt files.

dsesto · Accepted Answer

The thing with pre-trained Machine Learning models is that there is not much space for flexibility in what you want to achieve. Tools such as Google Cloud Natural Language offer some really interesting functionalities, but you cannot make them do other work for you. In such a casa, you would need to train your own models, or try a different approach, using tools such as TensorFlow, which require a high expertise in order to obtain decent results.

However, regarding your precise use case, you can use the analyzeEntities method to find named entities (common nouns and proper names). It turns out that, if the word attorney is next to the name of the person who is actually an attorney (as in "I am John, and my attorney James is working on my case." or your example "Hello world, the case is complex. John Due, the plaintiff in the case has hired attorney John Smith for the case."), it will bind those two entities together.

You can test that using the API Explorer with this call I share, and you will see that for the request:

{
 "document": {
  "content": "I am John, and my attorney James is working on my case.",
  "type": "PLAIN_TEXT"
 },
 "encodingType": "UTF8"
}

Some of the resulting entities are:

{
   "name": "James",
   "type": "PERSON",
   "metadata": {
   },
   "salience": 0.5714066,
   "mentions": [
    {
     "text": {
      "content": "attorney",
      "beginOffset": 18
     },
     "type": "COMMON"
    },
    {
     "text": {
      "content": "James",
      "beginOffset": 27
     },
     "type": "PROPER"
    }
   ]
  },
  {
   "name": "John",
   "type": "PERSON",
   "metadata": {
   },
   "salience": 0.23953272,
   "mentions": [
    {
     "text": {
      "content": "John",
      "beginOffset": 5
     },
     "type": "PROPER"
    }
   ]
  }

In this case, you will be able to parse the JSON response and see that James is (correctly) connected to the attorney noun, while John is not. However, as per some tests I have been running, this behavior seems to be only reproducible if the word attorney is next to one of the names you are trying to identify.

I hope this can be of help for you, but in case your needs are more complex, you will not be able to do that with an out-of-the-box solution such as Natural Language API.

Python Named Entities Recognition finding a specific entity

Answers (1)

Related Questions