Extract Best Image from Cluster of Webpages

Question

I've written some Java code that uses Crawler4J to crawl a bunch of webpages and then uses K-Means to cluster them by keywords. I want to select the best image from each cluster (where "best" is loosely defined as "best represents the topics in the cluster"), and I'm wondering whether there are any existing frameworks that do this (since it's obviously a problem that a lot of people have already needed to solve in displaying aggregated news, etc) before I roll my own.

Most of the pages that I'm crawling are standard news pages about a given topic, so the best image for a page is usually both 1) the biggest image and 2) the image immediately preceding the biggest block of text. If I have to roll my own implementation my tentative plan is to grab the best image from each page in the cluster based on those (and other) heuristics and then pick an image for the cluster based on both the quality (size, link text, name, position in document) of each image and the quality of the page that it came from.

To summarize, my question is twofold: are there any existing open source frameworks (preferably implemented in Java) that may be able to help with my task, and is there a better approach than the one that I'm proposing? Thanks!

Extract Best Image from Cluster of Webpages

Answers (1)

Related Questions