Ishwar
Ishwar

Reputation: 84

Sentiment Analysis with a LSTM for Youtube comments using Keras

I'm aiming to get my hands dirty by slowly scaling using LSTMs. However in the initial stages now, I'm trying to implement a Youtube LSTM sentiment analyzer using Keras. While searching for the resources available to aid me, I came across the IMDB sentiment analysis dataset and LSTM code. While it works great for longer inputs, shorter inputs don't do so well. The code is here at https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py

Upon saving the Keras model and building a prediction module for this data with this code

 model = load_model('ytsentanalysis.h5')
 print("Enter text")
 text=input()
 list=text_to_word_sequence(text,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")
 print(list)
 word_index = imdb.get_word_index()
 x_test = [[word_index[w] for w in list if w in word_index]]
 prediction=model.predict(x_test)
 print(prediction)

I feed in various inputs such as 'bad video' 'fantastic amazing' or 'good great' 'terrible bad'. The outputs range from close to 1 for similarly bad themed inputs and I've seen a 0.3ish prediction for a good themed input. I'd expect it should be closer to 1 for positive and closer to 0 for negative.

In an effort to solve this problem, I limited maxlen=20 while training and predicting because Youtube comments are much shorter, with the same code run again. This time the probabilities during prediction were all e^insert large negative power here

Is there no way I can adapt and reuse the existing dataset? If not, since labeled Youtube comment datasets aren't as extensive, should I use something like a Twitter comment dataset at the expense of losing the efficiency of the pre-built IMDB input modules in Keras? And is there any way I can see the code for those modules?

Thank you in advance for answering all these questions.

Upvotes: 1

Views: 775

Answers (1)

Abdelrahman Ahmed
Abdelrahman Ahmed

Reputation: 541

The difference between the IMDb dataset and YouTube comments is quite different since the movie reviews are quite long and extensive compared to comments and tweets.

It may be more helpful to train a model on a publicly available dataset (e.g. Tweets, that may be more inline with YT comments). You can then use the pre-trained model and fine-tune it on your YT comments dataset. Utilising some pre-trained word embeddings can be useful as well, such as GloVe and word2vec.

Alternatively, you can look into using NLTK to analyse the comments instead.

Upvotes: 1

Related Questions