Extracting fields from an emails based on values in a database as training set

Question

Ive got 480 emails and each of them consist of one or all of these values :-

[person, degree, working/not working, role]

So for example one of the email looks like this :-

    Hi Amy,

    I wanted to discuss about Bob. I can see that he has a degree in 
    Computer Science which he got three years ago but hes still unemployed. 
    I dont know whetehr he'll be fit for the role of junior programmer at 
    our institute.
    Will get back to you on this.

    Thanks

The corresponding database entry for this email looks like this

Email_123 | Bob | Computer Science | Unemployed | Junior Programmer

Now even though the data hasnt been labelled but we still have somewhat of a database to lookup which values were extracted into the 4 fields from each of the email. Now my question is that how can I use this corpus of 480 emails to learn and extract these 4 fields using Machine Learning/NLP. Do I need to manually tag all these 480 emails like..

I wanted to discuss about Bob. I can see that he has a degree in 
    Computer Science which he got....

Or is there a better way. Something like this (MarI/O - Machine Learning for Video Games) https://www.youtube.com/watch?v=qv6UVOQ0F44&t=149s

polm23 · Accepted Answer

Assuming that each email has only one value for each field, and that the value is always reproduced verbatim from the email, you can use something like WikiReading.

The problem is that WikiReading was trained on 4.7 million examples, so if you only have 480 that's nowhere near enough to train a good model.

What I would suggest is preprocessing your dataset to automatically add tags like in your example. Something like this, in pseudo-python:

entity = "Junior Programmer"
entity_type = "role"
mail = "...[text of email]..."

ind = mail.index(entity)
tagged = "{front}<{tag}>{ent}{back}".format(
  front=mail[0:ind],
  back=mail[ind+len(entity):],
  tag=entity_type,
  ent=entity)

You'll need to adjust for case issues, multiple matches, and so on.

With tagged data you can use a conventional NER system like a CRF. Here's a tutorial using spaCy in Python.

Extracting fields from an emails based on values in a database as training set

Answers (1)

Related Questions