Reputation: 913
Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer
in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.
Upvotes: 5
Views: 5678
Reputation: 3109
I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.
Upvotes: 4
Reputation: 11213
PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field
object. You can use string.split
, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process
method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.
Upvotes: 2