Reputation: 433
I am trying to learn BERT for text classification. I am finding some problem in preparing data for using BERT.
From my Dataset, I am segregating the sentiments and reviews as:
X = df['sentiments']
y = df['reviews'] #it contains four different class of reviews
Next,
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=512)
Here is where I get error:
ValueError Traceback (most recent call last)
<ipython-input-70-22714fcf7991> in <module>()
----> 1 train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=max_length)
2 #valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2261 if not _is_valid_text_input(text):
2262 raise ValueError(
-> 2263 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
2264 "or `List[List[str]]` (batch of pretokenized examples)."
2265 )
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
When I am trying to convert X to list and use it, I get another error:
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Can someone please explain where the problem is? Previously I followed a tutorial on 20 news dataset and it worked. But now when I am using it for another project, it doesn't work and I feel sad.
Thanks.
Upvotes: 4
Views: 7493
Reputation: 7379
The error is because, your X = df['sentiments']
and y = df['reviews']
lines, where your X and y are still dataframe columns (or dataframe series), and not list. A simplet way to change them is:
X = df['sentiments'].values
and y = df['reviews'].values
which returns numpy array, and it works. If notit can be further converted to python list using
X = df['sentiments'].values.tolist()
and y = df['reviews'].values.tolist()
Upvotes: 7