vinita
vinita

Reputation: 587

Extracting fields from an emails based on values in a database as training set

Ive got 480 emails and each of them consist of one or all of these values :-

[person, degree, working/not working, role]

So for example one of the email looks like this :-

    Hi Amy,

    I wanted to discuss about Bob. I can see that he has a degree in 
    Computer Science which he got three years ago but hes still unemployed. 
    I dont know whetehr he'll be fit for the role of junior programmer at 
    our institute.
    Will get back to you on this.

    Thanks

The corresponding database entry for this email looks like this

Email_123 | Bob | Computer Science | Unemployed | Junior Programmer

Now even though the data hasnt been labelled but we still have somewhat of a database to lookup which values were extracted into the 4 fields from each of the email. Now my question is that how can I use this corpus of 480 emails to learn and extract these 4 fields using Machine Learning/NLP. Do I need to manually tag all these 480 emails like..

I wanted to discuss about <person>Bob</person>. I can see that he has a degree in 
    <degree>Computer Science</degree> which he got....

Or is there a better way. Something like this (MarI/O - Machine Learning for Video Games) https://www.youtube.com/watch?v=qv6UVOQ0F44&t=149s

Upvotes: 1

Views: 86

Answers (1)

polm23
polm23

Reputation: 15593

Assuming that each email has only one value for each field, and that the value is always reproduced verbatim from the email, you can use something like WikiReading.

WikiReading Extraction Example

The problem is that WikiReading was trained on 4.7 million examples, so if you only have 480 that's nowhere near enough to train a good model.

What I would suggest is preprocessing your dataset to automatically add tags like in your example. Something like this, in pseudo-python:

entity = "Junior Programmer"
entity_type = "role"
mail = "...[text of email]..."

ind = mail.index(entity)
tagged = "{front}<{tag}>{ent}</{tag}>{back}".format(
  front=mail[0:ind],
  back=mail[ind+len(entity):],
  tag=entity_type,
  ent=entity)

You'll need to adjust for case issues, multiple matches, and so on.

With tagged data you can use a conventional NER system like a CRF. Here's a tutorial using spaCy in Python.

Upvotes: 1

Related Questions