Reputation: 304
What is the difference between Tokenization and Segmentation in NLP. I searched about them but I didn't really find any differences .
Upvotes: 9
Views: 2414
Reputation: 1376
Short answer: All tokenization is segmentation, but not all segmentation is tokenization.
Long Answer:
While segmentation is a more generic concept of splitting the input text, tokenization is a type of segmentation and it is carried out based on a well defined criteria.
For example - in a hypothetical scenario if all your input sentences are compound sentences of two sub-sentences, then splitting them into two independent sentences can be termed as segmentation (but not tokenization).
Tokenization is a form of segmentation which is performed on the basis of a semantic criteria or using a token dictionary - e.g. a word or sub-word tokenization, mainly with an intention of assigning them token ids for downstream processing.
Upvotes: 4