Reputation: 6756
Recently I posted this question and tried to solve my problem. My questions are
(['New Delhi is the capital of India', 'The capital of India is Delhi'])
, even if I add cls and sep tokens, the lengths are 9 and 8. max_seq_len parameter is 10, then why the last row of x1
and x2
are not the same?segment_ids
as 0 for all words in a paragraph. Is that correct?tokenization.py
file. But I dont see vocab.txt
. I see file 30k-clean.vocab
. Could i use 30k-clean.vocab
instead of vocab.txt
?Upvotes: 1
Views: 801
Reputation: 7369
import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True)
tokens = tokenizer.tokenize(example.text_a)
print(tokens)
This should give you word piece tokenized list, without [CLS]
and [SEP]
token.
Generally, word piece tokenization splits the words when words are not in vocabulary, this would create higher length of tokens than the number of inputs tokens.
You can pass both the sentences together, provided that the length of the paragraph after word piece tokenization does not exceed max_sequence length.
The vocab file for albert is in ./data/vocab.txt
directory. Provided you have got the albert code from: here.
In case, if you have got the model from tf-hub, the file is 2/assets/30k-clean.vocab
Upvotes: 1
Reputation: 426
@user2543622, you may refer to the official code here, in your case, you can do something like:
import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then
tokenization_info = albert_module(signature="tokenization_info",
as_dict=True)
with tf.Session() as sess:
vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'
I guess this vocab_file
is a binary sentencepiece model file, so you should this one for tokenization as below, instead of using the 30k-clean.vocab.
# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
vocab_file=vocab_file, do_lower_case=do_lower_case,
spm_model_file=FLAGS.spm_model_file)
If you only need the embedding matrix values, you take a look at the albert_module.variable_map
, e.g.:
print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>
Upvotes: 2