Reputation: 1049
I'm near a total outsider of programming, just interested in it. I work in a Shipbrokering company and need to match between positions (which ship will be open at where, when) and orders (what kind of ships will be needed at where, when for what kind of employment). And we send and receive such info (positions and orders) by emails to and from our principals and co-brokers. There are thousands of such emails each day. We do the matching by reading the emails manually.
I want to build an app to do the matching for us.
One important part of this app will do the information extraction from email text.
==> My question is how do I use Python to extract unstructured info into structured data.
Sample email of an order [annotation in the brackets, but is not included in the email]:
Email Subject: 20k dwt requirement, 20-30/mar, Santos-Conti
Content:
Acct ABC [Account Name]
Abt 20,000 MT Deadweight [Size of Ship Needed]
Delivery to make Santos [Delivery Point/Range, Owners will deliver the ship to Charterers here]
Laycan 20-30/Mar [Laycan (the time spread in which delivery can be accepted]
1 time charter with grains [What kind of Empolyment/Trade, Cargo]
Duration about 35 days [Duration]
Redelivery 1 safe port Continent [Redelivery Point/Range, Charterers will redeliver the ship back to Owners here.]
Broker name/email/phone...
End Email
Same email above can be written in many different ways - some writes in one line, some use l/c instead of laycan... And there are emails for positions with ship's name, open port, date range, ship's deadweight and other specs.
How can I extract the info and put it into structured data, with Python? Let's say I have put all email contents into text files. Thanks.
Upvotes: 4
Views: 3735
Reputation: 1288
Below is a possible approach:
Step 1: Classify the mails in categories using the subject and/or message in the mail.
As you stated one category is of mails requesting position and the other is of mails of order. Machine Learning can be used to classify. You can use set of previous mails as training corpus. You might consider using NLTK(Natural Langauage Toolkit) for Python. Here is the link on text classification using NLTK.
Step 2: Once an email is identified as an order mail, process it to fetch the details(account name, size, time spread etc.) As you mentioned the challenge here is that there is no fixed format for these data. To solve this problem, you might consider preparing an exhaustive list of synonyms for each label(like for account the list could be like ['acct', 'a/c', 'account', 'acnt']
). This should be done once, by going through a fixed volume of previous mails.
To make the solution more effective, you could consider implementing option for active learning
(i.e., prompt the user if in a mail a lable is found which is not found in any list. E.g. in a mail, if "accnt"
is used, it wont be resolved, hence user should be prompted to ask in which category it falls.)
Once a lable is identifies, you can use basic string operations, to parse the email in a fetch relevant data in structured format.
You can refer to this discussion for a better understanding.
Upvotes: 1