Lucas Azevedo
Lucas Azevedo

Reputation: 2370

ValueError: Can't convert non-rectangular Python sequence to Tensor when using tf.data.Dataset.from_tensor_slices

This issue has been posted a handful of times in SO, but I still can't figure out what is the problem with my code, especially because it comes from a tutorial in medium and the author makes the code available on google colab

I have seen other users having problem with wrong variable types #56304986 (which is not my case, as my model input is the output of tokenizer) and even seen the function I am trying to use (tf.data.Dataset.from_tensor_slices) being suggested as a solution #56304986.

The line yielding error is:

# train dataset
ds_train_encoded = encode_examples(ds_train).shuffle(10000).batch(batch_size)

where the method encode_examples is defined as (I have inserted an assert line into the encode_examples method to be sure my problem was not unmatching lenghts):

def encode_examples(ds, limit=-1):
    # prepare list, so that we can build up final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    if (limit > 0):
        ds = ds.take(limit)

    for review, label in tfds.as_numpy(ds):

            bert_input = convert_example_to_feature(review.decode())

            ii = bert_input['input_ids']
            tti = bert_input['token_type_ids']
            am = bert_input['attention_mask']

            assert len(ii) == len(tti) == len(am), "unmatching lengths!"

            input_ids_list.append(ii)
            token_type_ids_list.append(tti)
            attention_mask_list.append(am)
            label_list.append([label])

    return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

The data is loaded like this (here i changed the dataset to get only 10% of the training data so I could speed up the debugging)

(ds_train, ds_test), ds_info = tfds.load('imdb_reviews', split = ['train[:10%]','test[10%:15%]'], as_supervised=True, with_info=True)

And the other two calls(convert_example_to_feature and map_example_to_dict) and the tokenizer are as follow:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
def convert_example_to_feature(text):
    # combine step for tokenization, WordPiece vector mapping, adding special tokens as well as truncating reviews longer than the max length
    return tokenizer.encode_plus(text,
                                 add_special_tokens = True, # add [CLS], [SEP]
                                 #max_length = max_length, # max length of the text that can go to BERT
                                 pad_to_max_length = True, # add [PAD] tokens
                                 return_attention_mask = True,)# add attention mask to not focus on pad tokens

def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return ({"input_ids": input_ids,
            "token_type_ids": token_type_ids,
            "attention_mask": attention_masks,
            }, label)

I suspect the error might have something to do with different versions of TensorFlow (I am using 2.3), but unfortunately I couldn't run the snippets in the google.colab notebook for memory reasons.

Does anyone know where what is the problem with my code? Thanks for your time and attention.

Upvotes: 0

Views: 2440

Answers (2)

Poe Dator
Poe Dator

Reputation: 4903

One other possible cause is that truncation should be explicitly enabled in the tokenizer. The parameter is truncation = True

Upvotes: 0

Lucas Azevedo
Lucas Azevedo

Reputation: 2370

Turns out that I had caused the trouble by having commented the line

#max_length = max_length, # max length of the text that can go to BERT

I assumed it would truncate on the model max size or that it would take the longest input as the max size. It does none of it and then even if I have the same amount of entries, those entries vary in size, generating a non-rectangular tensor.

I've removed the # and am using 512 as max_lenght. Which is the max that BERT takes anyways. (see transformer's tokenizer class for reference)

Upvotes: 1

Related Questions