Reasons for using Bag of Words in Computer Vision

Question

Why would one choose a Bag of Words approach in computer vision?

For example: If one uses HOG features as descriptor and applies a BOW approach to this features, the result would be a histogram of histograms.

I can see the advantages of dimension reduction in this approach and also the fixed sized of generated bins but is this really the only reason? Because the reduction also causes a loss in information.

I can also think of just resizing the images to a fixed, usually smaller size and to calculate the HOGs. The resulting vector would also have a fixed size so it could be used with a classifier as well. This would also cause a loss in information, especially when the fixed image size is very small but it won't be that drastically as with k means.

Niki · Accepted Answer

I think the idea is something like this: The low-level feature detector finds small "relevant" patches, and the descriptor + k-means algorithm packs them into bags like "a headlight", "a tire", "a car roof". Then, if you find a pair of headlights, two tires and a car roof, you're probably looking at a car.

The advantage would be that it doesn't matter where the tires and the headlights are, so it doesn't matter if you're looking at a side view or a front view or a different model of car. If you apply a feature descriptor directly to the whole image, a side view and a front view would get completely different descriptions.

Reasons for using Bag of Words in Computer Vision

Answers (1)

Related Questions