Nguyen Hoang Chu
Nguyen Hoang Chu

Reputation: 1

Does Fine-tunning Bert Model in multiple times with different dataset make it more accuracy?

i'm totally new in NLP and Bert Model. What im trying to do right now is Sentiment Analysis on Twitter Trending Hashtag ("neg", "neu", "pos") by using DistilBert Model, but the accurazcy was about 50% ( I tried w Label data taken from Kaggle). So here is my idea: (1) First, I will Fine-tunning Distilbertmodel (Model 1) with IMDB dataset, (2) After that since i've got some data took from Twitter post, i will sentiment analysis them my Model 1 and get Result 2. (3) Then I will refine-tunning Model 1 with the Result 2 and expecting to have Model (3).

Im not really sure this process has any meaning to make the model more accuracy or not. Thanks for reading my post.

Upvotes: 0

Views: 1631

Answers (2)

Nick's Pizza
Nick's Pizza

Reputation: 56

If you want to fine-tune a sentiment classification head of BERT for classifying tweets, then I'd recommend a different strategy:

  1. IMDB dataset is a different kind of sentiment - the ratings do not really correspond with short post sentiment, unless you want to focus on tweets regarding movies.

  2. using classifier output as input for further training of that classifier is not really a good approach, because, if the classifier made many mistakes while classifying, these will be reflected in the training, and so the errors will deapen. This is basically creating endogenous labels, which will not really improve your real-world classification.

  3. You should consider other ways of obtaining labelled training data. There are a few good examples for twitter:

  • Twitter datasets on Kaggle - there are plenty of datasets available containing millions of various tweets. Some of those even contain sentiment labels (usually inferred from emoticons, as these were proven to be more accurate than words in predicting sentiment - for explanation see e.g. Frasincar 2013). So that's probably where you should look.

  • Stocktwits (if youre interested in financial sentiments)- contain posts that authors can label for sentiments, thus are a perfect way of mining labelled data, if stocks/cryptos is what you're looking for.

Another thing is picking a model that's better for your language, I'd recommend this one. It has been pretrained on 80M tweets, so should provide strong improvements. I believe it even contains a sentiment classification head that you can use.

Roberta Twitter Base

Check out the website for that and guidance for loading the model in your code - it's very easy, just use the following code (this is for sentiment classification):

MODEL = "cardiffnlp/twitter-roberta-base-sentiment"


tokenizer = AutoTokenizer.from_pretrained(MODEL)

model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Another benefit of this model is that it has been pretrained from scratch with a vocabulary that contains emojis, meaning it has a deep understanding of them, their typical contexts and co-occurences. This can greatly benefit a social media classification, as many researchers would agree that emojis/emoticons are better predictors of sentiment than normal words.

Upvotes: 1

Mohsen
Mohsen

Reputation: 31

I'm a little skeptical about your first step. Since the IMDB database is different from your target database, I do not think it will positively affect the outcome of your work. Thus, I would suggest fine-tuning it on a dataset like a tweeter or other social media hashtags; however, if you are only focusing on hashtags and do not care about the text, that might work! My little experience with fine-tuning transformers like BART and BERT shows that the dataset that you are working on should be very similar to your actual data. But in general, you can fine-tune a model with different datasets, and if the datasets are structured for one goal, it can improve the model's accuracy.

Upvotes: 1

Related Questions