Reputation: 140
I want to create a siamese network to compare similarity of two strings.
I am trying to follow this tutorial. This example works with images, but I want to work with string representations (at character level) and I am stuck with preprocessing of text.
Let's imagine that I have two inputs:
string_a = ["one","two","three"]
string_b = ["four","five","six"]
And I need to prepare it for input of my model. To do so I need to:
So I am trying the following:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
#create a tokenizer
tok = Tokenizer(char_level=True,oov_token="?")
tok.fit_on_texts(string_a+string_b)
char_index = tok.word_index
maxlen = max([len(x) for x in tok.texts_to_sequences(string_a+string_b)])
#create a dataframe
dataset_a = tf.data.Dataset.from_tensor_slices(string_a)
dataset_b = tf.data.Dataset.from_tensor_slices(string_b)
dataset = tf.data.Dataset.zip((dataset_a,dataset_b))
# preprocessing functions
def tokenize_string(data,tokenizer,max_len):
"""vectorize string with a given tokenizer
"""
sequence = tokenizer.texts_to_sequences(data)
return_seq = pad_sequences(sequence,maxlen=max_len,padding="post",truncating="post")
return return_seq[0]
def preprocess_couple(string_1,string_2):
"""given 2 strings, tokenize them and return an array
"""
return (
tokenize_string([string_1], tok, maxlen),
tokenize_string([string_2], tok, maxlen)
)
#shuffle and preprocess dataset
dataset = dataset.shuffle(buffer_size=2)
dataset = dataset.map(preprocess_couple)
However I get an error:
AttributeError: in user code:
<ipython-input-29-b920d389ea82>:29 preprocess_couple *
tokenize_string([string_2], tok, maxlen)
<ipython-input-29-b920d389ea82>:20 tokenize_string *
sequence = tokenizer.texts_to_sequences(data)
C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages\keras_preprocessing\text.py:281 texts_to_sequences *
return list(self.texts_to_sequences_generator(texts))
C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages\keras_preprocessing\text.py:306 texts_to_sequences_generator **
text = text.lower()
C:\HOMEWARE\Miniconda3-Windows-x86_64\envs\embargo_text\lib\site-packages\tensorflow\python\framework\ops.py:401 __getattr__
self.__getattribute__(name)
The dataset state before application of preprocess_couple function is the following:
(<tf.Tensor: shape=(), dtype=string, numpy=b'two'>, <tf.Tensor: shape=(), dtype=string, numpy=b'five'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'three'>, <tf.Tensor: shape=(), dtype=string, numpy=b'six'>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'one'>, <tf.Tensor: shape=(), dtype=string, numpy=b'four'>)
I think this error comes from the fact that the strings are transformed to tensors by function from_tensor_slices. But what is the proper way to preprocess this data for inputs?
Upvotes: 0
Views: 244
Reputation: 59
i am not getting what you actually want to achieve but if want to convert your text to vectors this will help
def process(data):
tok = Tokenizer(char_level=True,oov_token="?")
tok.fit_on_texts(data)
maxlen = max([len(x) for x in tok.texts_to_sequences(data)])
data=tok.texts_to_sequences(data)
data=pad_sequences(data,maxlen=maxlen,padding='post')
return data
Upvotes: 0