Preprocessing text for siamese network

Question

I want to create a siamese network to compare similarity of two strings.

I am trying to follow this tutorial. This example works with images, but I want to work with string representations (at character level) and I am stuck with preprocessing of text.

Let's imagine that I have two inputs:

string_a = ["one","two","three"]
string_b = ["four","five","six"]

And I need to prepare it for input of my model. To do so I need to:

create a tokenizer
create a tf dataframe
preprocess this dataframe (tokenize inputs)

So I am trying the following:

    import tensorflow as tf
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences

    #create a tokenizer
    tok = Tokenizer(char_level=True,oov_token="?")
    tok.fit_on_texts(string_a+string_b)
    char_index = tok.word_index
    maxlen = max([len(x) for x in tok.texts_to_sequences(string_a+string_b)])
    
    #create a dataframe
    dataset_a = tf.data.Dataset.from_tensor_slices(string_a)
    dataset_b = tf.data.Dataset.from_tensor_slices(string_b)
    
    dataset = tf.data.Dataset.zip((dataset_a,dataset_b))
    
    # preprocessing functions
    def tokenize_string(data,tokenizer,max_len):
        """vectorize string with a given tokenizer
        """
        sequence = tokenizer.texts_to_sequences(data)
        return_seq = pad_sequences(sequence,maxlen=max_len,padding="post",truncating="post")
        return return_seq[0]
    
    def preprocess_couple(string_1,string_2):
        """given 2 strings, tokenize them and return an array
        """
        return (
            tokenize_string([string_1], tok, maxlen),
            tokenize_string([string_2], tok, maxlen)
        )
    
    #shuffle and preprocess dataset
    dataset = dataset.shuffle(buffer_size=2)
    dataset = dataset.map(preprocess_couple)

However I get an error:

AttributeError: in user code:

    :29 preprocess_couple  *
        tokenize_string([string_2], tok, maxlen)
    :20 tokenize_string  *
        sequence = tokenizer.texts_to_sequences(data)
    C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages\keras_preprocessing	ext.py:281 texts_to_sequences  *
        return list(self.texts_to_sequences_generator(texts))
    C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages\keras_preprocessing	ext.py:306 texts_to_sequences_generator  **
        text = text.lower()
    C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages	ensorflow\python\framework\ops.py:401 __getattr__
        self.__getattribute__(name)

The dataset state before application of preprocess_couple function is the following:

(, )
(, )
(, )

I think this error comes from the fact that the strings are transformed to tensors by function from_tensor_slices. But what is the proper way to preprocess this data for inputs?

Preprocessing text for siamese network

Answers (1)

Related Questions