Reputation:
I'm try to make an model that classify the text in 3 categories.(Negative,Neural,Positive)
I have csv file that contain comments on different apps with their rating.
First I import all the necessary libraries
!pip install transformers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%tensorflow_version 2.x
import tensorflow as tf
from transformers import TFBertForSequenceClassification, BertTokenizer,DistilBertTokenizer,glue_convert_examples_to_features, InputExample,BertConfig,InputFeatures
from sklearn.model_selection import train_test_split
from tqdm import tqdm
%matplotlib inline
Then i'll get my csv file
!gdown --id 1S6qMioqPJjyBLpLVz4gmRTnJHnjitnuV
!gdown --id 1zdmewp7ayS4js4VtrJEHzAheSW-5NBZv
df = pd.read_csv("reviews.csv")
print(df[['content','score']].head())
content score
0 Update: After getting a response from the deve... 1
1 Used it for a fair amount of time without any ... 1
2 Your app sucks now!!!!! Used to be good but no... 1
3 It seems OK, but very basic. Recurring tasks n... 1
4 Absolutely worthless. This app runs a prohibit... 1
Converting scores to sentiment
def to_sentiment(rating):
rating = int(rating)
if rating <= 2:
return 0
elif rating == 3:
return 1
else:
return 2
df['sentiment'] = df.score.apply(to_sentiment)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased',do_lower_case = True)
Creating Helper Methods to fit the data into model
def convert_example_to_feature(review):
return tokenizer.encode_plus(
review,
add_special_tokens=True,
max_length=160, # truncates if len(s) > max_length
return_token_type_ids=True,
return_attention_mask=True,
pad_to_max_length=True, # pads to the right by default
)
def map_example_to_dict(input_ids,attention_mask,token_type_ids,label):
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"token_type_ids" : token_type_ids
},label
def encode_examples(ds):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
token_type_ids_list = []
attention_mask_list = []
label_list = []
for index, row in tqdm(ds.iterrows()):
bert_input = convert_example_to_feature(row['content'])
input_ids_list.append(bert_input['input_ids'])
token_type_ids_list.append(bert_input['token_type_ids'])
attention_mask_list.append(bert_input['attention_mask'])
label_list.append([row['sentiment']])
return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)
df_train, df_test = train_test_split(df,test_size=0.1)
Creating Model
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy()
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss,metrics=metric)
history = model.fit(ds_train_encoded,epochs=1)
14/443 [..............................] - ETA: 3:58 - loss: nan - accuracy: 0.3438
If i change the count of the sentiment and make it just positive and negative then it works. But with 3 or more labels creates this problem.
Upvotes: 1
Views: 1598
Reputation: 139
The label classes index should start from 0 not 1.
TFBertForSequenceClassification requires labels in the range [0,1,...]
labels (tf.Tensor of shape (batch_size,), optional, defaults to None) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
Source: https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification
Upvotes: 0