Reputation: 84
I'm aiming to get my hands dirty by slowly scaling using LSTMs. However in the initial stages now, I'm trying to implement a Youtube LSTM sentiment analyzer using Keras. While searching for the resources available to aid me, I came across the IMDB sentiment analysis dataset and LSTM code. While it works great for longer inputs, shorter inputs don't do so well. The code is here at https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py
Upon saving the Keras model and building a prediction module for this data with this code
model = load_model('ytsentanalysis.h5')
print("Enter text")
text=input()
list=text_to_word_sequence(text,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,split=" ")
print(list)
word_index = imdb.get_word_index()
x_test = [[word_index[w] for w in list if w in word_index]]
prediction=model.predict(x_test)
print(prediction)
I feed in various inputs such as 'bad video' 'fantastic amazing' or 'good great' 'terrible bad'. The outputs range from close to 1 for similarly bad themed inputs and I've seen a 0.3ish prediction for a good themed input. I'd expect it should be closer to 1 for positive and closer to 0 for negative.
In an effort to solve this problem, I limited maxlen=20 while training and predicting because Youtube comments are much shorter, with the same code run again. This time the probabilities during prediction were all e^insert large negative power here
Is there no way I can adapt and reuse the existing dataset? If not, since labeled Youtube comment datasets aren't as extensive, should I use something like a Twitter comment dataset at the expense of losing the efficiency of the pre-built IMDB input modules in Keras? And is there any way I can see the code for those modules?
Thank you in advance for answering all these questions.
Upvotes: 1
Views: 775
Reputation: 541
The difference between the IMDb dataset and YouTube comments is quite different since the movie reviews are quite long and extensive compared to comments and tweets.
It may be more helpful to train a model on a publicly available dataset (e.g. Tweets, that may be more inline with YT comments). You can then use the pre-trained model and fine-tune it on your YT comments dataset. Utilising some pre-trained word embeddings can be useful as well, such as GloVe and word2vec.
Alternatively, you can look into using NLTK to analyse the comments instead.
Upvotes: 1