katiex7
katiex7

Reputation: 913

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.

Upvotes: 5

Views: 5678

Answers (2)

Feng Mai
Feng Mai

Reputation: 3109

I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:

from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]

print(encoded_data)

[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]

encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()

BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10])) ​

It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.

Upvotes: 4

Jindřich
Jindřich

Reputation: 11213

PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).

Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.

If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

Upvotes: 2

Related Questions