Reputation: 11
i am using spacy to get a benefit from it's dependency parsing, i am having a trouble in making spcay tokenizer tokenize the new vocabs i am adding. this is my code:
nlp = spacy.load("en_core_web_md")
nlp.vocab['bone morphogenetic protein (BMP)-2']
nlp.tokenizer = Tokenizer(nlp.vocab)
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'
doc = nlp(text)
print([(token.text,token.tag_) for token in doc])
output:
[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone', 'NN'), ('morphogenetic', 'JJ'), ('protein', 'NN'), ('(BMP)-2', 'NNP'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(BMPRIB).', 'NN')]
the desire output:
[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NN'), ('for', 'IN'), ('BMP receptor type IB', 'NNP'), ('(', '('), ('BMPRIB', 'NNP'), (')', ')'), ('.', '.')]
how can i make spacy tokenize the new vocabs i added?
Upvotes: 1
Views: 878
Reputation: 11
I found the solution in nlp.tokenizer.tokens_from_list I broke my sentence in to list of words and then it tokenized it as it was desire
import space
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = nlp.tokenizer.tokens_from_list
for doc in nlp.pipe([['This', 'study', 'describes', 'the', 'distributions', 'of', 'bone morphogenetic protein (BMP)-2', 'as', 'well', 'as', 'mRNAs', 'for','BMP receptor type IB', '(', 'BMPRIB', ')', '.']]):
for token in doc:
print(token,'//',token.dep_)
Upvotes: 0
Reputation: 25249
See if Doc.retokenize()
may help you:
import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'
doc = nlp(text)
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[6:11])
print([(token.text,token.tag_) for token in doc])
[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(', '-LRB-'), ('BMPRIB', 'NNP'), (')', '-RRB-'), ('.', '.')]
Upvotes: 1