Reputation: 35
I am working on a chatbot project using NLP. I am using spacy and I want to get pos of the tokens in sentence. Currently I am using this code
en = spacy.load("en_core_web_md")
pos_sent = "lib/lzma.py this module provides classes and convenience functions for compressing and decompressing data using the lzma compression algorithm."
pos_sent = en(pos_sent)
for token in pos_sent:
print(token, token.pos_)
But this tends to split tokens on symbols also which I don't want. For example, this treats "lib", "/", "lizma.py" as seperate tokens. But in the orginal sentence it is 1 whole word. Is there some way in which I can get the POS of the word without it being split on symbols ?
Upvotes: 1
Views: 490
Reputation: 3174
Well your text is not really natural language/full sentences only, so the model doesn't know what to do with that path and treats it as two words separated by a slash.
You can add a special rule to the tokenizer or create a custom tokenizer class in SpaCy, see https://spacy.io/usage/linguistic-features#special-cases and https://spacy.io/usage/linguistic-features#native-tokenizers. Might get a bit tricky with paths/urls though.
Or you do the tokenization entirely outside of SpaCy (see @Stefs comment) and then hand pretokenized sentences to SpaCy. If you only need the tokenization and Part-Of-Speech-Tagging you should check other frameworks/methods like NLTK as well to see if they handle it more similar to the way you would like it, as each model is going to do this based on how it was trained.
Also if you are using SpaCy and you use it for a chatbot (in real-time) you should disable not-needed components to speed the processing up.
Upvotes: 1