Reputation: 33
I am going to make a classifier that can categorize images. I know that I should use convolutional neural network for this. The thing is that for every image I have a discription. Is there any way that I can use this description to improve the classifier?
Upvotes: 3
Views: 755
Reputation: 723
The easiest thing to do is use both image features (CNN) and text feature (in form of LSTM language model, Bag-of-words, or off-the-shelf encoders like skip-thought vectors) and train the network to make the predictions about the image class the usual way. The two features can be combined by concatenation, element-wise multiplication, element-wise sum or outer-product. Take a look at recent progress in visual question answering (VQA), what you're describing sounds like a subset of what could be done with VQA.
Upvotes: 1
Reputation: 7148
Sure Neural networks have been used on Text like in https://arxiv.org/pdf/1609.08144v2.pdf. You only want to output classes and not sentences so you have an easier time then they have. To combine the classifier you could use a weighted rank sum on the outputs.
How much the classifier improves sounds very interesting to me and could be the basis for a publication.
Upvotes: 0