Reputation: 801
I try to create a custom standardize function for the TextVectorization layer in Tensorflow 2.1 but I seem to get something fundamentally wrong.
I have the following text data:
import numpy as np
my_array = np.array([
"I am a sentence.",
"I am another sentence!"
])
I basically want to lower the text, remove punctuation and remove some words.
The default standardize-Function of the TextVectorization layer (LOWER_AND_STRIP_PUNCTUATION
) lowers and removes punctuation, but afaik there is not way to remove whole words.
(If you know a way to do so, an alternative approach to mine as described below is of course also very much appreciated)
First, find an example of a working custom standardiazion function from the tensorflow documentation
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation), '')
when I pass it to the TextVectorization
and adapt in on my_array
, it works just fine
vectorize_layer_1 = TextVectorization(
output_mode='int',
standardize=custom_standardization,
)
vectorize_layer_1.adapt(my_array) # no error
However, my custom standardization keeps raising an error. Here is my code:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.text import text_to_word_sequence
my_array = np.array([
"I am a sentence",
"I am another sentence"
])
# these words should be removed
bad_words = ["i", "am"]
def remove_words(tokens):
return [word for word in tokens if word not in bad_words]
# this is the normalization function I want to apply
def my_custom_normalize(my_array):
tokenized = [text_to_word_sequence(str(sentence)) for sentence in my_array]
clean_texts = [" ".join(remove_words(tokenized_string))
for tokenized_string
in tokenized]
clean_tensor = tf.convert_to_tensor(clean_texts)
return clean_tensor
my_vectorize_layer = TextVectorization(
output_mode='int',
standardize=my_custom_normalize,
)
However, once I try adapting, I keep running in an error:
my_vectorize_layer.adapt(my_array) # raises error
InvalidArgumentError: Tried to squeeze dim index 1 for tensor with 1 dimensions. [Op:Squeeze]
And I really do not understand why. In the documentation it says:
When using a custom callable for standardize, the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input
I thought maybe thats what is causing the error. but when I look at the shapes, everything seems correct:
my_result = my_custom_normalize(my_array)
my_result.shape # returns TensorShape([2])
working_result = custom_standardization(my_array)
working_result.shape # returns TensorShape([2])
I am really lost here. What am I doing wrong? am I not supposed to use list comprehensions?
Upvotes: 2
Views: 3146
Reputation: 260
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
stripped_html = tf.strings.regex_replace(stripped_html,r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', ' ')
stripped_html = tf.strings.regex_replace(stripped_html, r'@([A-Za-z0-9_]+)', ' ' )
for i in stopwords_eng:
stripped_html = tf.strings.regex_replace(stripped_html, f' {i} ', " ")
return tf.strings.regex_replace(
stripped_html, "[%s]" % re.escape(string.punctuation), ""
)
Upvotes: 1