Reputation: 386

Torchtext vocab getting tokens from index

from torchtext.vocab import Vocab
from collections import Counter

def create_vocab(file_, tokenizer):
   counter_dict = Counter()
   for sentence in file_:
      counter_dict.update(tokenizer(sentence))

   return Vocab(counter_dict)



vocab = create_vocab(w_data, tokenizer)

vocab.lookup_indices([1,2,3])

Now on trying to look-up the indices, it gives:

AttributeError: ‘Counter’ object has no attribute ‘lookup_indices’

Upvotes: 1

Answers (1)

user2314737

Reputation: 29317

lookup_indices takes as input a list of strings (tokens) and returns a list of integers (their indices in the Vocab):

lookup_indices(tokens: List[str]) → List[int]

To get the tokens corresponding to indices [1,2,3] use lookup_tokens instead.

Yes, it's a bit confusing one thinks that lookup_indices looks for indices and lookup_tokens for tokens but it's the other way around!

Here's a small example:

from collections import Counter, OrderedDict
from torchtext.vocab import vocab

c = Counter(["a", "a", "b", "b", "b", "c"])
# https://pytorch.org/text/stable/vocab.html
# requires an OrderedDict
ordered_dict = OrderedDict(sorted(c.items(), key=lambda x: x[1], reverse=True))

myVocabulary = vocab(ordered_dict)
print("myVocabulary is of type {}".format(type(myVocabulary)))
# myVocabulary is of type <class 'torchtext.vocab.vocab.Vocab'>
print(myVocabulary['a'], myVocabulary['b']) 
# 1 0
print(myVocabulary.lookup_indices(['a', 'b'])) 
# [1, 0]
print(myVocabulary.lookup_tokens([0, 1])) 
# ['b', 'a']
print(myVocabulary.lookup_indices(['a', 'b', 'c'])) 
# [1, 0, 2]
print(myVocabulary['c']) 
# 2

Upvotes: 1

Torchtext vocab getting tokens from index

Answers (1)

Related Questions