Reputation: 417
I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
However, with the code I wrote, the output is this:
{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}
This is the code I have so far:
def get_titles(text):
pattern = re.compile('[A-Z][a-z]{1,3}\.')
title_tokens = set(re.findall(pattern, text))
pattern2 = re.compile('[A-Z][a-z]{1,3}')
pseudo_titles = set(re.findall(pattern2, text))
pseudo_titles = [word.strip() for word in pseudo_titles]
pseudo_titles = [word.replace('\n', '') for word in pseudo_titles]
difference = title_tokens.difference(pseudo_titles)
return difference
test = get_titles(text)
print(test)
As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.
The text can be found here: http://www.gutenberg.org/files/76/76-0.txt
Upvotes: 1
Views: 395
Reputation: 51063
Essentially, you are asking for an algorithm which can tell the difference between a title and one-word sentence. These are lexically indistinguishable; for example, consider the following two strings:
In the first sentence, "Yes." is a one-word sentence, and in the second, "Mr." is a title. As humans we only know this because we understand the meanings of the tokens "Yes" and "Mr"; so an algorithm which is able to distinguish between these cases requires some information about the meanings of the tokens it's parsing. It cannot work purely lexically like a regex does. This means you must either write a whitelist of allowed titles, or a blacklist of words which are not titles, or otherwise the problem is much more difficult.
Alternatively, if your project doesn't involve parsing titles from very many novels, you could just trim down the results by hand, using your human knowledge that "Tom" and "Yes" aren't titles. It shouldn't be that much work.
Upvotes: 2