Tensorflow TextVectorization layer: How to define a custom standardize function?

Question

I try to create a custom standardize function for the TextVectorization layer in Tensorflow 2.1 but I seem to get something fundamentally wrong.

I have the following text data:

import numpy as np

my_array = np.array([
    "I am a sentence.",
    "I am another sentence!"
])

My Goal

I basically want to lower the text, remove punctuation and remove some words. The default standardize-Function of the TextVectorization layer (LOWER_AND_STRIP_PUNCTUATION) lowers and removes punctuation, but afaik there is not way to remove whole words.

(If you know a way to do so, an alternative approach to mine as described below is of course also very much appreciated)

An example that is working

First, find an example of a working custom standardiazion function from the tensorflow documentation

def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '
', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

when I pass it to the TextVectorization and adapt in on my_array, it works just fine

vectorize_layer_1 = TextVectorization(
    output_mode='int',
    standardize=custom_standardization,
    )

vectorize_layer_1.adapt(my_array)  # no error

The custom function that is not working

However, my custom standardization keeps raising an error. Here is my code:

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.preprocessing.text import text_to_word_sequence

my_array = np.array([
    "I am a sentence",
    "I am another sentence"
])

# these words should be removed
bad_words = ["i", "am"]

def remove_words(tokens):
    return [word for word in tokens if word not in bad_words]

# this is the normalization function I want to apply
def my_custom_normalize(my_array):
    tokenized = [text_to_word_sequence(str(sentence)) for sentence in my_array]
    clean_texts = [" ".join(remove_words(tokenized_string))
                     for tokenized_string
                     in tokenized]
    clean_tensor = tf.convert_to_tensor(clean_texts)
    return clean_tensor
    
my_vectorize_layer = TextVectorization(
    output_mode='int',
    standardize=my_custom_normalize,
    )

However, once I try adapting, I keep running in an error:

my_vectorize_layer.adapt(my_array)  # raises error

InvalidArgumentError: Tried to squeeze dim index 1 for tensor with 1 dimensions. [Op:Squeeze]

And I really do not understand why. In the documentation it says:

When using a custom callable for standardize, the data received by the callable will be exactly as passed to this layer. The callable should return a tensor of the same shape as the input

I thought maybe thats what is causing the error. but when I look at the shapes, everything seems correct:

my_result = my_custom_normalize(my_array)
my_result.shape  # returns TensorShape([2])
working_result = custom_standardization(my_array)
working_result.shape # returns TensorShape([2])

I am really lost here. What am I doing wrong? am I not supposed to use list comprehensions?

Extender · Accepted Answer

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "
", " ")
    stripped_html = tf.strings.regex_replace(stripped_html,r'\d+(?:\.\d*)?(?:[eE][+-]?\d+)?', ' ')
    stripped_html = tf.strings.regex_replace(stripped_html, r'@([A-Za-z0-9_]+)', ' ' )
    for i in stopwords_eng:
        stripped_html = tf.strings.regex_replace(stripped_html, f' {i} ', " ")
    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )

Tensorflow TextVectorization layer: How to define a custom standardize function?

My Goal

An example that is working

The custom function that is not working

Answers (1)

Related Questions