Extracting titles from a text

Question

I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:

['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']

However, with the code I wrote, the output is this:

{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}

This is the code I have so far:

def get_titles(text):
  pattern = re.compile('[A-Z][a-z]{1,3}\.')
  title_tokens = set(re.findall(pattern, text))
  pattern2 = re.compile('[A-Z][a-z]{1,3}')
  pseudo_titles = set(re.findall(pattern2, text))

  pseudo_titles = [word.strip() for word in pseudo_titles]
  pseudo_titles = [word.replace('
', '') for word in pseudo_titles]

  difference = title_tokens.difference(pseudo_titles)
  return difference 

test = get_titles(text)
print(test)

As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.

The text can be found here: http://www.gutenberg.org/files/76/76-0.txt

Extracting titles from a text

Answers (1)

Related Questions