Deep learning - use both images and their description

Question

I am going to make a classifier that can categorize images. I know that I should use convolutional neural network for this. The thing is that for every image I have a discription. Is there any way that I can use this description to improve the classifier?

Prophecies · Accepted Answer

The easiest thing to do is use both image features (CNN) and text feature (in form of LSTM language model, Bag-of-words, or off-the-shelf encoders like skip-thought vectors) and train the network to make the predictions about the image class the usual way. The two features can be combined by concatenation, element-wise multiplication, element-wise sum or outer-product. Take a look at recent progress in visual question answering (VQA), what you're describing sounds like a subset of what could be done with VQA.

Deep learning - use both images and their description

Answers (2)

Related Questions