Reputation: 718
I need to extract email address of a person. I have trained the NER model in Spacy with a few examples but no luck. It has to be trained with thousands of examples to get satisfying results. So, I have now started to look at Token Matcher to fetch the email address. Did anyone work on this before? is there a better approach for this ?
Upvotes: 3
Views: 5816
Reputation: 1096
I ve stumbled upon Alexander Crosson's medium post on this topic https://medium.com/@acrosson/extracting-names-emails-and-phone-numbers-5d576354baa
this nice regex based approach works for me (as long as phone number is 10 digits (no country code)) -
import re
def get_phone_numbers(string):
r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
phone_numbers = r.findall(string)
return [re.sub(r'\D', '', num) for num in phone_numbers]
def get_email_addresses(string):
r = re.compile(r'[\w\.-]+@[\w\.-]+')
return r.findall(string)
Upvotes: 1
Reputation: 718
I have used syntactic dependencies to cover few types of rules to identify relations:
see the below code
for email in doc:
print(email.text, email.dep_, email.ent_type_, email.pos_, email.head)
if(email.like_email == True):
if email.dep_ in ("attr", "dobj", "punct"):
subject = [w for w in email.head.lefts if w.dep_ == "nsubj" or w.dep_ == "nsubjpass"]
if subject:
subject = subject[0]
per = extract_person_names(subject.text)
if(per.text != "null"):
relations.append((per, email))
else:
print("no entity")
elif email.dep_ == "pobj" and email.head.dep_ == "prep":
if ((doc[email.head.i-1]).ent_type_ == 'PERSON'):
relations.append((doc[email.head.i-1], email))
Upvotes: 0
Reputation: 7105
Email addresses should be straightforward to extract – you can write a token pattern or even look at a token's like_email
attribute, which will return True
if it resembles an email address.
To find out how the email address token is related to the rest of the sentence, one approach is to look at the syntax and write your own extraction logic using the syntactic dependencies (token.dep_
), part-of-speech tags (token.pos_
) or subtree (token.subtree
).
Here's an example that shows the idea:
The email address is attached to the verb "is", which is attached to the subject of the sentence "email address". The proper noun "Peter" is attached to the subject with the label poss
(possesive). So the owner of the email address is Peter. If your sentences look like this, you can write a function that extracts this information based on the tokens and their relationships.
Of course, it's not always that easy – your texts might look very different and you might have to write logic for various different constructions. For more info and examples, see the documentation on combining models and rules.
Upvotes: 4
Reputation: 86
Try haptik-ner, although it's use is specific to chat bots you may be able to use the code to detect emails as well.
Upvotes: 0