Reputation: 1541
Using tutorials here , I wrote the following codes:
from transformers import GPT2Tokenizer, GPT2Model
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
So I realize that "inputs", consists of tokenized items of my sentence. But how can I get the values of tokenized items? (see for example ["hello", ",", "my", "dog", "is", "cute"])
I am asking this because sometimes I think it separetes a word if that word is not in its dictionary (i.e., a word from another language). So I want to check that in my codes.
Upvotes: 1
Views: 1821
Reputation: 432
You can call tokenizer.decode
on the output of the tokenizer to get the words from its vocabulary under given indices:
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> list(map(tokenizer.decode, inputs.input_ids[0]))
['Hello', ',', ' my', ' dog', ' is', ' cute']
Upvotes: 3