Meta
Meta

Reputation: 41

Giving your own custom tags to tokenize data in nltk?

While running the below code

from nltk import word_tokenize, pos_tag, ne_chunk


sentence = "Antacids is given to Jhon,Sodium Bicarbonate is given to Carl,Folic Acid to Jeery  all works at Google " 
print(ne_chunk(pos_tag(word_tokenize(sentence))))

i am getting this output

(S
  (GPE Antacids/NNP)
  is/VBZ
  given/VBN
  to/TO
  (PERSON Jhon/NNP)
  ,/,
  (PERSON Sodium/NNP Bicarbonate/NNP)
  is/VBZ
  given/VBN
  to/TO
  (GPE Carl/NNP)
  ,/,
  (PERSON Folic/NNP Acid/NNP)
  to/TO
  (GPE Jeery/NNP)
  all/DT
  works/NNS
  at/IN
  (ORGANIZATION Google/NNP))

I want to assign the medicines like (Antacid,Sodium, Folic) to same category.

Which library I can use for this purpose ?

Upvotes: 1

Views: 1205

Answers (1)

Jibsgrl
Jibsgrl

Reputation: 95

Do you want to keep some misspellings in your text? For example Jhon or John, Jeery or Jerry, upper case for common nouns (Sodium Bicarbonate would be sodium bicarbonate with lower cases..)?

Embedded NER (Named Entity Recognition) in python libraries are trained with clean text, since you have a misspelled text it'll be hard to achieve 100% accuracy with a generic NER.

With a correct sentence and the spacy library you can get the correct output:

import spacy

nlp = spacy.load('en')
doc = nlp("Antacids is given to John, sodium bicarbonate is given to Carl, folic acid to Jerry all works at Google")

for token in doc:
    print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
          format(token.idx, token.text, token.i, token.pos_)

print('Entities', [(e.text, e.label_) for e in doc.ents])

With the result (Antacids, sodium bicarbonate and acid are tagged as NOUN):

token.i: 0  token.idx: 0    token.pos: NOUN      token.text: Antacids
token.i: 1  token.idx: 9    token.pos: VERB      token.text: is
token.i: 2  token.idx: 12   token.pos: VERB      token.text: given
token.i: 3  token.idx: 18   token.pos: ADP       token.text: to
token.i: 4  token.idx: 21   token.pos: PROPN     token.text: John
token.i: 5  token.idx: 25   token.pos: PUNCT     token.text: ,
token.i: 6  token.idx: 27   token.pos: NOUN      token.text: sodium
token.i: 7  token.idx: 34   token.pos: NOUN      token.text: bicarbonate
token.i: 8  token.idx: 46   token.pos: VERB      token.text: is
token.i: 9  token.idx: 49   token.pos: VERB      token.text: given
token.i: 10 token.idx: 55   token.pos: ADP       token.text: to
token.i: 11 token.idx: 58   token.pos: PROPN     token.text: Carl
token.i: 12 token.idx: 62   token.pos: PUNCT     token.text: ,
token.i: 13 token.idx: 64   token.pos: ADJ       token.text: folic
token.i: 14 token.idx: 70   token.pos: NOUN      token.text: acid
token.i: 15 token.idx: 75   token.pos: ADP       token.text: to
token.i: 16 token.idx: 78   token.pos: PROPN     token.text: Jerry
token.i: 17 token.idx: 84   token.pos: DET       token.text: all
token.i: 18 token.idx: 88   token.pos: VERB      token.text: works
token.i: 19 token.idx: 94   token.pos: ADP       token.text: at
token.i: 20 token.idx: 97   token.pos: PROPN     token.text: Google

And the entities are correctly labeled:

Entities [('John', 'PERSON'), ('Carl', 'PERSON'), ('Jerry', 'PERSON'), ('Google', 'ORG')]

Upvotes: 1

Related Questions