how to prevent NLTK to split specifics words?

Question

I have a list of stackoverflow tags : [javascript, node.js, c++, amazon-s3,....].

I want to tokenize a stackoverflow question : "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."

and I want nltk to tokenize 'node.js' into a single token : "node.js", not 'node' and 'js'.

How to tell nltk to not split a word if it is in my tag list ?

I have read this possible duplicate, and the question seems to be the same, but the answer based on Multi Word Expression Tokenizer doesn't satisfy my need.

In fact if I use this solution, I think I'll have to reconstruct manually all tags, example :

tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe('Python', '-', '3', '.', 'x)

My need is to keep all existing tags as "untokenizable"

Isaac B · Accepted Answer

I don't know the full range of tags that you're looking to retain as whole tokens, but it seems that NLTK's basic word_tokenize() function will preserve those particular items as tokens, without any tag list defined.

import nltk
sentence = "what do I prefer ? javascript, node.js, c++ or amazon-S3 ? This is dummy."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Output:

['what', 'do', 'I', 'prefer', '?', 'javascript', ',', 'node.js', ',', 'c++', 'or', 'amazon-S3', '?', 'This', 'is', 'dummy', '.']

how to prevent NLTK to split specifics words?

Answers (1)

Related Questions