user2543622
user2543622

Reputation: 6756

tensorflow_hub to pull BERT embedding on windows machine - extending to albert

Recently I posted this question and tried to solve my problem. My questions are

  1. is my approach correct?
  2. My example sentences length are 7 and 6 respectively - (['New Delhi is the capital of India', 'The capital of India is Delhi']), even if I add cls and sep tokens, the lengths are 9 and 8. max_seq_len parameter is 10, then why the last row of x1 and x2 are not the same?
  3. How to get embedding when I have a paragraph of more than 2 sentences? do i have to pass one sentence at a time? But in such case wont i loose information as I am not passing all sentences together?
    • I did some additional research and it seems that I can pass entire paragraph as a single sentence using segment_ids as 0 for all words in a paragraph. Is that correct?
  4. how to get embedding for ALBERT? I see that the ALBERT also has tokenization.py file. But I dont see vocab.txt. I see file 30k-clean.vocab. Could i use 30k-clean.vocab instead of vocab.txt?

Upvotes: 1

Views: 801

Answers (2)

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

  1. Your approach seems right
  2. Could you please check the tokenizations of sentence 1 and 2 using the tokenizer, this should reveal if there are additional word pieces in one of the sentences. This can be checked as below:
import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True)
tokens = tokenizer.tokenize(example.text_a)
print(tokens)

This should give you word piece tokenized list, without [CLS] and [SEP] token.

Generally, word piece tokenization splits the words when words are not in vocabulary, this would create higher length of tokens than the number of inputs tokens.

  1. You can pass both the sentences together, provided that the length of the paragraph after word piece tokenization does not exceed max_sequence length.

  2. The vocab file for albert is in ./data/vocab.txt directory. Provided you have got the albert code from: here. In case, if you have got the model from tf-hub, the file is 2/assets/30k-clean.vocab

Upvotes: 1

Arron Cao
Arron Cao

Reputation: 426

@user2543622, you may refer to the official code here, in your case, you can do something like:

import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then 
tokenization_info = albert_module(signature="tokenization_info",
                                  as_dict=True)
with tf.Session() as sess:
  vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                        tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'

I guess this vocab_file is a binary sentencepiece model file, so you should this one for tokenization as below, instead of using the 30k-clean.vocab.

# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
  vocab_file=vocab_file, do_lower_case=do_lower_case,
  spm_model_file=FLAGS.spm_model_file)

If you only need the embedding matrix values, you take a look at the albert_module.variable_map, e.g.:

print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>

Upvotes: 2

Related Questions