user1895076
user1895076

Reputation: 759

Fast Named Entity Removal with NLTK

I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?

import nltk
import string

# Function to reverse tokenization
def untokenize(tokens):
    return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())

# Remove named entities
def ne_removal(text):
    tokens = nltk.word_tokenize(text)
    chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
    tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
    return(untokenize(tokens))

To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:

text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']

UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.

def extract_nonentities(tree):
    tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
    return(untokenize(tokens))

def fast_ne_removal(text_list):
    token_list = [nltk.word_tokenize(text) for text in text_list]
    tagged = nltk.pos_tag_sents(token_list)
    chunked = nltk.ne_chunk_sents(tagged)
    non_entities = []
    for tree in chunked:
        non_entities.append(extract_nonentities(tree))
    return(non_entities)

Upvotes: 2

Views: 3052

Answers (1)

alexis
alexis

Reputation: 50220

Every time you call ne_chunk(), it needs to initialize a chunker object and load the statistical model for chunking from disk. Ditto for pos_tag(). So instead of calling them on one sentence at a time, call their batch versions on the complete list of texts:

all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)

This should give you a considerable speed-up. If that's still too slow for your needs, try profiling your code and consider whether you need to switch to more high-powered tools, like @Lenz suggested.

Upvotes: 2

Related Questions