Reputation: 33
I'm using a TextVectorization Layer in a TF Keras Sequential model. I need to convert the intermediate TextVectorization layer's output to plain text. I've found that there is no direct way to accomplish this. So I used the TextVectorization layer's vocabulary to inverse transform the vectors. The code is as follows:
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
text_list = np.array(["this is the first sentence.","second line of the dataset."]) # a list of 2 sentences
textvectorizer = TextVectorization(max_tokens=None,
standardize=None,
ngrams=None,
output_mode="int",
output_sequence_length=None,
pad_to_max_tokens=False)
textvectorizer.adapt(text_list)
vectors = textvectorizer(text_list)
vectors
Vectors:
array([[ 3, 7, 2, 9, 4],
[ 5, 6, 8, 2, 10]])
Now, I want to convert the vectors to sentences.
my_vocab = textvectorizer.get_vocabulary()
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
print(plain_text_list)
Output:
['this is the first sentence.', 'second line of the dataset.']
I was successful in obtaining the desired result. However, due to the naive approach I used in the code, when applied to a large dataset, this method is extremely slow. I'd like to reduce the execution time of this method.
Upvotes: 1
Views: 431
Reputation: 26708
Maybe try np.vectorize
:
import numpy as np
my_vocab = textvectorizer.get_vocabulary()
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
print(np.vectorize(index_vocab.get)(vectors))
[['this' 'is' 'the' 'first' 'sentence.']
['second' 'line' 'of' 'the' 'dataset.']]
Performance test:
import numpy as np
import timeit
my_vocab = textvectorizer.get_vocabulary()
def method1(my_vocab, vectors):
index_vocab = dict(zip(np.arange(len(my_vocab)), my_vocab))
return np.vectorize(index_vocab.get)(vectors)
def method2(my_vocab, vectors):
plain_text_list = []
for line in vectors:
text = ' '.join(my_vocab[idx] for idx in line)
plain_text_list.append(text)
return plain_text_list
t1 = timeit.Timer(lambda: method1(my_vocab, vectors))
t2 = timeit.Timer(lambda: method2(my_vocab, vectors))
print(t1.timeit(5000))
print(t2.timeit(5000))
0.3139600929998778
19.671524284000043
Upvotes: 1