FredBones
FredBones

Reputation: 1415

Image Cropping - Region of Interest Query

I have a set of videos of someone talking, I'm building a lip recognition system therefore I need to perform some image processing on specific region of the image (the lower chin and lips).

I have over 200 videos, each containing a sentence. It is natural conversation therefore the head constantly moves so the lips aren't in a fixed place. I'm having difficulty specifying my region of interest in the image as it is very tiresome having to watch through every video and mark out how big my box to ensure the lips are cropped within the ROI.

I was wondering if there would be an easier way to check this, perhaps using MATLAB? I was thinking I could crop the video frame by frame and output an image for each frame. And then physically go through the images to see if the lips go out of frame?

Upvotes: 3

Views: 1546

Answers (1)

ely
ely

Reputation: 77504

I had to solve a similar problem dealing with tracking the heads and limbs of students participating in class discussions on video. We experimented with using state of the art optical flow tracking, from Thomas Brox ( link, see the part about large-displacement optical flow.) In our case, we had nearly 20 terabytes of videos to work through, so we had no choice but to use a C++ and GPU implementation of optical flow code; I think you will discover too that Matlab is impossibly slow for doing video analysis.

Optical flow returns to you detailed motion vectors. Then, if you can just mark the original bounding box for the mouth and chin in the first frame of the video, you can follow the tracks given by the optical flow of those pixels and this will usually give you a good sequence of bounding boxes. You will probably have errors that you have to clean up, though. You could write a Python script that plays back the sequence of bounding boxes for you to quickly check for errors though.

The code I wrote for this is in Python, and it's probably not easy to adapt to your data set-up or your problem, but you can find my affine-transformation based optical flow tracking code linked here, in the part called 'Object tracker using dense optical flow.'

The short answer is that this is a very difficult and annoying problem for vision researchers. Most people "solve" it by placing their videos, frame by frame, onto Mechanical Turk, and paying human workers about 2 cents per frame that they analyze. This gives you pretty good results (you'll still have to clean them after collecting it from the Mechanical Turkers), but it's not very helpful when you have tons o' videos and you cannot wait for enough of them to randomly get analyzed on Mechanical Turk.

There definitely isn't any 'out of the box' solution to region-of-interest annotation, though. You'd probably have to pay quite a lot for third-party software that did this automatically. My best guess for that is to check out what face.com would charge you and how well it would perform. Be careful that you don't violate any researcher confidentiality agreements with your data set though, for this or Mechanical Turk.

Upvotes: 1

Related Questions