Fast Named Entity Removal with NLTK

Question

I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?

import nltk
import string

# Function to reverse tokenization
def untokenize(tokens):
    return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())

# Remove named entities
def ne_removal(text):
    tokens = nltk.word_tokenize(text)
    chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
    tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
    return(untokenize(tokens))

To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:

text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']

UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.

def extract_nonentities(tree):
    tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
    return(untokenize(tokens))

def fast_ne_removal(text_list):
    token_list = [nltk.word_tokenize(text) for text in text_list]
    tagged = nltk.pos_tag_sents(token_list)
    chunked = nltk.ne_chunk_sents(tagged)
    non_entities = []
    for tree in chunked:
        non_entities.append(extract_nonentities(tree))
    return(non_entities)

Fast Named Entity Removal with NLTK

Answers (1)

Related Questions