Reputation: 1715
I've written some Java code that uses Crawler4J to crawl a bunch of webpages and then uses K-Means to cluster them by keywords. I want to select the best image from each cluster (where "best" is loosely defined as "best represents the topics in the cluster"), and I'm wondering whether there are any existing frameworks that do this (since it's obviously a problem that a lot of people have already needed to solve in displaying aggregated news, etc) before I roll my own.
Most of the pages that I'm crawling are standard news pages about a given topic, so the best image for a page is usually both 1) the biggest image and 2) the image immediately preceding the biggest block of text. If I have to roll my own implementation my tentative plan is to grab the best image from each page in the cluster based on those (and other) heuristics and then pick an image for the cluster based on both the quality (size, link text, name, position in document) of each image and the quality of the page that it came from.
To summarize, my question is twofold: are there any existing open source frameworks (preferably implemented in Java) that may be able to help with my task, and is there a better approach than the one that I'm proposing? Thanks!
Upvotes: 0
Views: 83
Reputation: 77474
How about choosing the image from the most central item? Since k-means partitions around centroids, you can treat the instance closest to the centroid to be the best representative in your data. (If you would use this in the clustering, you would get k-medoids).
Since k-means can degenerate badly, you may want to check that the cluster elements are nearer to the cluster center than distance between two cluster centers is. If cluster centers are closer to each other than your data, your k-means result has degenerated.
Upvotes: 1