Hari Prasad
Hari Prasad

Reputation: 1901

How to speedup Stanford NLP in Python?

import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
    #english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                           '/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
                           encoding='utf-8')

After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?

%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)

Output

[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s

Upvotes: 0

Views: 927

Answers (3)

J4FFLE
J4FFLE

Reputation: 182

After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.

I found Stanza by following the link provided in Christopher Manning's answer.

To download: pip install stanza

then in Python:

import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences

If you only want a specific tool (NER), you can specify with processors as: nlp = stanza.Pipeline('en',processors='tokenize,ner')

For an output similar to that produced by the OP:

classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]

But to produce a list of only those words that are recognizable entities:

classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]

It produces a couple of features that I really like:

  1. instead of each word being classified as a separate person entity, it combines John Doe into one 'PERSON' object.
  2. If you do want each separate word, you can extract those and it identifies which part of the object it is ('B' for the first word in the object, 'I' for the intermediate words, and 'E' for the last word in the object)

Upvotes: 2

Christopher Manning
Christopher Manning

Reputation: 9450

Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger but instead to use the newer nltk.parse.CoreNLPParser . See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .

More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer if you just need NER or the StanfordCoreNLPServer for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python

Upvotes: 4

Hari Prasad
Hari Prasad

Reputation: 1901

Found the answer.

Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.

java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz

Then initiate Stanford NLP Server tagger in Python using sner library.

from sner import Ner
tagger = Ner(host='localhost',port=9199)

Then run the tagger.

%%time
classified_text=tagger.get_entities(text)
print (classified_text)

Output:

    [('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms

Almost 300 times better performance in terms of timing! Wow!

Upvotes: 3

Related Questions