Reputation: 1901
import numpy as np
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
#english.all.3class.distsim.crf.ser.gz
st = StanfordNERTagger('/media/sf_codebase/modules/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
'/media/sf_codebase/modules/stanford-ner-2018-10-16/stanford-ner.jar',
encoding='utf-8')
After initializing above code Stanford NLP following code takes 10 second to tag the text as shown below. How to speed up?
%%time
text="My name is John Doe"
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
Output
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 20 ms, total: 24 ms
Wall time: 10.9 s
Upvotes: 0
Views: 927
Reputation: 182
After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.
I found Stanza by following the link provided in Christopher Manning's answer.
To download:
pip install stanza
then in Python:
import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences
If you only want a specific tool (NER), you can specify with processors
as:
nlp = stanza.Pipeline('en',processors='tokenize,ner')
For an output similar to that produced by the OP:
classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]
But to produce a list of only those words that are recognizable entities:
classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]
It produces a couple of features that I really like:
Upvotes: 2
Reputation: 9450
Another solution within NLTK is to not use the old nltk.tag.StanfordNERTagger
but instead to use the newer nltk.parse.CoreNLPParser
. See, e.g., https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK .
More generally the secret to good performance is indeed to use a server on the Java side, which you can repeatedly call without having to start new subprocesses for each sentence processed. You can either use the NERServer
if you just need NER or the StanfordCoreNLPServer
for all CoreNLP functionality. There are a number of Python interfaces to it, see: https://stanfordnlp.github.io/CoreNLP/other-languages.html#python
Upvotes: 4
Reputation: 1901
Found the answer.
Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.
from sner import Ner
tagger = Ner(host='localhost',port=9199)
Then run the tagger.
%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms
Almost 300 times better performance in terms of timing! Wow!
Upvotes: 3