Nirali Khoda
Nirali Khoda

Reputation: 388

extract name entity from unstructured data

I have highly unstructured data and I want to extract full name out of It. The data is something like this

txt = " 663555 murphy rd suite 106 richardson tx 7508 usa 111 it park indore 452 010 india ph 91 987 4968420 123456789 sumeetlogikviewcom  Nirali Khoda cofounder analytics pvt ltd ideata  a comprehensive data analytics platform"

text = "dicictay  8 8 8 bf infotech pvt ltd manager infotech pvt ltd  redefining technologies 91 12345 12345 zoeb fatemi "

I tried spacy and standfordNER but It is not giving good results. It gives me name from address like this

en = spacy.load('en_core_web_md')

txt = txt.title().strip()

sents = en(txt)

people = [ee for ee in sents.ents if ee.label_ == 'PERSON']

out put is this :

[663555 Murphy Rd Suite, Analytics Pvt Ltd Ideata]

expected output :

[Nirali Khoda]

Help would be appreciated. Thanks :)

Upvotes: 1

Views: 1203

Answers (1)

dennlinger
dennlinger

Reputation: 11488

Before I start, I want to mention that I used spacy.load("en_core_web_lg") for my code instead - this seems to affect the results of the parsing quite significantly, so this could be an initial approach for your problem.
I also had no running installation of StanfordNER locally, so I used their provided web interface instead.

The problem with NER is problematic in this case - as you kind of already mentioned - due to the fact that your "sentences" lack any cohesive structure. The problem is that mos NER accuracy comes from context information which is clearly lacking in your case.
This is also nicely visualized by parsing one of the sentences from your examples in the web interface mentioned above: The parsed sentence tree looks very scary (obviously), and there is not much we can take from there.

I also parsed the first sentence with SpaCy, and got the following result when looking at the recognized entities:

663555 DATE
106 Richardson PERSON
Tx GPE
7508 DATE
Usa GPE
111 CARDINAL
Park Indore GPE
452 010 CARDINAL
India GPE
91 CARDINAL
987 CARDINAL
123456789 DATE
Sumeetlogikviewcom PERSON
Nirali Khoda Cofounder Analytics Pvt Ltd Ideata ORG
Comprehensive Data Analytics Platform ORG

As we can see, the problem is two-fold here: Not only is the instance with your name in it mislabeled (ORG instead of PERSON), but it also shows that the initial split into different entities is problematic.

I am assuming that you have some way of accessing the data extraction pipeline, and are not "blindly" taking these from somewhere else. This is specifically important so you can introduce some form of separation between different containers; albeit most preprocessor have some form of boilerblate (that removes HTML tags and "unifies" them), some form of separation might do you good: I slightly altered your input to the following:

txt = " 663555 murphy rd suite 106 richardson tx 7508 usa , 111 it park indore 452 010 india ph 91 987 4968420 123456789 , sumeetlogikviewcom ,  Nirali Khoda , cofounder analytics pvt ltd , ideata  a comprehensive data analytics platform"

Then, I performed the same processing again, and - look at that - ended up with the following result:

663555 DATE
106 Richardson PERSON
Tx GPE
7508 DATE
Usa GPE
111 CARDINAL
Park Indore GPE
452 010 CARDINAL
India GPE
91 CARDINAL
987 CARDINAL
123456789 DATE
Sumeetlogikviewcom PERSON
Nirali Khoda PERSON
Cofounder Analytics Pvt Ltd ORG
Ideata   ORG

This time, the result is both correctly split up, as well as (more) correctly classified. Obviously you are still not getting perfect results, but that is seldomly the case with NER.

If you want to only recognize names, you can also "manually parse" them, regardless of the underlying entities, with a more crude approach: You might want to let SpaCy or CoreNLP split the different entities, and then - regardless of the actual tag - check for each entity whether it contains a token that is contained in a set of common first/last names (data for the U.S., for example, can be found here). I am sure there exist more comprehensive lists, and this might be a good substitution, if you are literally only looking for names. Of course, this is unlikely to perfectly solve your problem as well (think of Toyota, which is incidentally also a very common last name in Japanese; or something like Mr. Propper, which (to a computer) might as well be a "person" as well).

Upvotes: 2

Related Questions