How to prepare text for BERT - getting error

Question

I am trying to learn BERT for text classification. I am finding some problem in preparing data for using BERT.

From my Dataset, I am segregating the sentiments and reviews as:

X = df['sentiments']
y = df['reviews'] #it contains four different class of reviews

Next,

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=512)

Here is where I get error:

ValueError                                Traceback (most recent call last)
 in ()
----> 1 train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=max_length)
      2 #valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2261         if not _is_valid_text_input(text):
   2262             raise ValueError(
-> 2263                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2264                 "or `List[List[str]]` (batch of pretokenized examples)."
   2265             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

When I am trying to convert X to list and use it, I get another error:

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Can someone please explain where the problem is? Previously I followed a tutorial on 20 news dataset and it worked. But now when I am using it for another project, it doesn't work and I feel sad.

Thanks.

Ashwin Geet D&#39;Sa · Accepted Answer

The error is because, your X = df['sentiments'] and y = df['reviews'] lines, where your X and y are still dataframe columns (or dataframe series), and not list. A simplet way to change them is:

X = df['sentiments'].values and y = df['reviews'].values

which returns numpy array, and it works. If notit can be further converted to python list using

X = df['sentiments'].values.tolist() and y = df['reviews'].values.tolist()

How to prepare text for BERT - getting error

Answers (1)

Related Questions