Reputation: 865
I have a list of stackoverflow tags : [javascript, node.js, c++, amazon-s3,....].
I want to tokenize a stackoverflow question : "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
and I want nltk to tokenize 'node.js' into a single token : "node.js", not 'node' and 'js'.
How to tell nltk to not split a word if it is in my tag list ?
I have read this possible duplicate, and the question seems to be the same, but the answer based on Multi Word Expression Tokenizer doesn't satisfy my need.
In fact if I use this solution, I think I'll have to reconstruct manually all tags, example :
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe('Python', '-', '3', '.', 'x)
My need is to keep all existing tags as "untokenizable"
Upvotes: 1
Views: 371
Reputation: 160
I don't know the full range of tags that you're looking to retain as whole tokens, but it seems that NLTK's basic word_tokenize()
function will preserve those particular items as tokens, without any tag list defined.
import nltk
sentence = "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
tokens = nltk.word_tokenize(sentence)
print(tokens)
Output:
['what', 'do', 'I', 'prefer', '?', 'javascript', ',', 'node.js', ',', 'c++', 'or', 'amazon-S3', '?', 'This', 'is', 'dummy', '.']
Upvotes: 1