keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Question

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.

Feng Mai · Accepted Answer

I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:

from torchnlp.encoders.text import StaticTokenizerEncoder， stack_and_pad_tensors, pad_tensor

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]

print(encoded_data)

[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]

encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()

BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))

It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Answers (2)

Related Questions