Leena
Leena

Reputation: 11

how to tokenize new vocab in spacy?

i am using spacy to get a benefit from it's dependency parsing, i am having a trouble in making spcay tokenizer tokenize the new vocabs i am adding. this is my code:

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

output:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone', 'NN'), ('morphogenetic', 'JJ'), ('protein', 'NN'), ('(BMP)-2', 'NNP'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(BMPRIB).', 'NN')]

the desire output:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NN'), ('for', 'IN'), ('BMP receptor type IB', 'NNP'), ('(', '('), ('BMPRIB', 'NNP'), (')', ')'), ('.', '.')]

how can i make spacy tokenize the new vocabs i added?

Upvotes: 1

Views: 878

Answers (2)

Leena
Leena

Reputation: 11

I found the solution in nlp.tokenizer.tokens_from_list I broke my sentence in to list of words and then it tokenized it as it was desire

import space

nlp = spacy.load("en_core_web_sm")

nlp.tokenizer = nlp.tokenizer.tokens_from_list

for doc in nlp.pipe([['This', 'study', 'describes', 'the', 'distributions', 'of', 'bone morphogenetic protein (BMP)-2', 'as', 'well', 'as', 'mRNAs', 'for','BMP receptor type IB', '(', 'BMPRIB', ')', '.']]):

for token in doc:

   print(token,'//',token.dep_)

Upvotes: 0

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25249

See if Doc.retokenize() may help you:

import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[6:11])

print([(token.text,token.tag_) for token in doc])

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(', '-LRB-'), ('BMPRIB', 'NNP'), (')', '-RRB-'), ('.', '.')]

Upvotes: 1

Related Questions