How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

Question

I made a TensorFlow CsvDataset, and I'm trying to tokenize the data as such:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')

fname = 'rotten_tomatoes_reviews.csv'


def preprocess(target, inputs):
    tok = Tokenizer(num_words=5_000, lower=True)
    tok.fit_on_texts(inputs)
    vectors = tok.texts_to_sequences(inputs)
    return vectors, target


dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True).map(preprocess)

Running this, gives the following error:

ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)

What I've tried: just about anything in the realm of possibilities. Note that everything runs if I remove the preprocessing step.

What the data looks like:

(,
 )

today · Accepted Answer

First of all, let's find out the problems in your code:

The first problem, which is also the reason behind the given error, is that the fit_on_texts method accepts a list of texts, not a single text string. Therefore, it should be: tok.fit_on_texts([inputs]).
After fixing that and running the code again, you would get another error: AttributeError: 'Tensor' object has no attribute 'lower'. This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them; however, the Tokenizer class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).
The biggest problem is that each time the map function, i.e. preprocess, is called, a new instance of Tokenizer class is created and it would be fit on a single text document. Update: As @Princy correctly pointed out in the comments section, the fit_on_texts method actually performs a partial fit (i.e. updates or augments the internal vocabulary stats, instead of starting from scratch). So if we create the Tokenizer class outside the preprocess function and assuming the vocabulary set is known beforehand (otherwise, you can't filter the most frequent words in a partial fit scheme unless you have or build the vocabulary set first), then it would be possible to use this approach (i.e. based on Tokenizer class) after applying the above fixes as well. However, personally, I prefer the solution below.

So, what should we do? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, i.e. encode them. For performing encoding, first we need a vocabulary set or a dictionary of tokens. Therefore, the steps we should take are as follows:

If there is a pre-built vocabulary available, then skip to the next step. Otherwise, tokenize all the text data first and build the vocabulary.
Encode the text data using the vocabulary set.

For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.

For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step. Note that, for this step we are using map method; however, since map only functions in graph mode, we have wrapped our encode function in tf.py_function so that it could be used with map.

Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):

import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter

fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True)

# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()

# Find unique tokens in the dataset.
lowercase = True  # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
    if lowercase:
       text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)

# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))

# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=lowercase,
                                              tokenizer=tokenizer)

# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None

def encode(target, text):
    text_encoded = encoder.encode(text.numpy())
    if max_len:
        text_encoded = text_encoded[:max_len]
    return text_encoded, target

# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
    text_encoded, target = tf.py_function(encode,
                                          inp=[target, text],
                                          Tout=(tf.int32, tf.int32))
    
    # (optional) Set the shapes for efficiency.
    text_encoded.set_shape([None])
    target.set_shape([])

    return text_encoded, target

# Apply encoding and then padding.
# Note: if you want the sequences in all the batches 
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
                                                padded_shapes=([None,], []))

# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).

Side note for future readers: notice that the order of arguments, i.e. target, text, and the data types are based on the OP's dataset. Adapt as needed based on your own dataset/task (although, at the end, i.e. return text_encoded, target, we adjusted this to make it compatible with expected format of fit method).

How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

Answers (1)

Related Questions