Reputation: 41
While running the below code
from nltk import word_tokenize, pos_tag, ne_chunk
sentence = "Antacids is given to Jhon,Sodium Bicarbonate is given to Carl,Folic Acid to Jeery all works at Google "
print(ne_chunk(pos_tag(word_tokenize(sentence))))
i am getting this output
(S
(GPE Antacids/NNP)
is/VBZ
given/VBN
to/TO
(PERSON Jhon/NNP)
,/,
(PERSON Sodium/NNP Bicarbonate/NNP)
is/VBZ
given/VBN
to/TO
(GPE Carl/NNP)
,/,
(PERSON Folic/NNP Acid/NNP)
to/TO
(GPE Jeery/NNP)
all/DT
works/NNS
at/IN
(ORGANIZATION Google/NNP))
I want to assign the medicines like (Antacid,Sodium, Folic) to same category.
Which library I can use for this purpose ?
Upvotes: 1
Views: 1205
Reputation: 95
Do you want to keep some misspellings in your text? For example Jhon or John, Jeery or Jerry, upper case for common nouns (Sodium Bicarbonate would be sodium bicarbonate with lower cases..)?
Embedded NER (Named Entity Recognition) in python libraries are trained with clean text, since you have a misspelled text it'll be hard to achieve 100% accuracy with a generic NER.
With a correct sentence and the spacy
library you can get the correct output:
import spacy
nlp = spacy.load('en')
doc = nlp("Antacids is given to John, sodium bicarbonate is given to Carl, folic acid to Jerry all works at Google")
for token in doc:
print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
format(token.idx, token.text, token.i, token.pos_)
print('Entities', [(e.text, e.label_) for e in doc.ents])
With the result (Antacids, sodium bicarbonate and acid are tagged as NOUN):
token.i: 0 token.idx: 0 token.pos: NOUN token.text: Antacids
token.i: 1 token.idx: 9 token.pos: VERB token.text: is
token.i: 2 token.idx: 12 token.pos: VERB token.text: given
token.i: 3 token.idx: 18 token.pos: ADP token.text: to
token.i: 4 token.idx: 21 token.pos: PROPN token.text: John
token.i: 5 token.idx: 25 token.pos: PUNCT token.text: ,
token.i: 6 token.idx: 27 token.pos: NOUN token.text: sodium
token.i: 7 token.idx: 34 token.pos: NOUN token.text: bicarbonate
token.i: 8 token.idx: 46 token.pos: VERB token.text: is
token.i: 9 token.idx: 49 token.pos: VERB token.text: given
token.i: 10 token.idx: 55 token.pos: ADP token.text: to
token.i: 11 token.idx: 58 token.pos: PROPN token.text: Carl
token.i: 12 token.idx: 62 token.pos: PUNCT token.text: ,
token.i: 13 token.idx: 64 token.pos: ADJ token.text: folic
token.i: 14 token.idx: 70 token.pos: NOUN token.text: acid
token.i: 15 token.idx: 75 token.pos: ADP token.text: to
token.i: 16 token.idx: 78 token.pos: PROPN token.text: Jerry
token.i: 17 token.idx: 84 token.pos: DET token.text: all
token.i: 18 token.idx: 88 token.pos: VERB token.text: works
token.i: 19 token.idx: 94 token.pos: ADP token.text: at
token.i: 20 token.idx: 97 token.pos: PROPN token.text: Google
And the entities are correctly labeled:
Entities [('John', 'PERSON'), ('Carl', 'PERSON'), ('Jerry', 'PERSON'), ('Google', 'ORG')]
Upvotes: 1