Ivor Zhou
Ivor Zhou

Reputation: 1273

How to get the most representative image of a webpage?

There are some cases when you want to get a most representative image of a web page, e.g. Pocket would try to add an image when you collect a web page.

How would you define, in a programmatic way, which image is the key image? What would be the most appropriate way to do so?

Upvotes: 1

Views: 647

Answers (2)

Hassan Baig
Hassan Baig

Reputation: 15834

Study scraper.py to see how reddit uses BeautifulSoup to find representative images from links submitted to it.

Upvotes: 2

V13Axel
V13Axel

Reputation: 838

Most websites that are seeking to be shared on sites like Facebook or Pocket will have an Open Graph protocol image. This is often an image in the head tag that uses the format <meta property="og:image" content="http://URL-TO-YOUR-IMAGE" />. The Open Graph protocol is used and looked for by companies such as Facebook, Pocket, Reddit, and has become fairly widespread in use.

For websites that do not follow such a standard, developers will often use a third-party tool such as Embedly, which has already solved the problem. Simply feed it a URL and it will return you some information on what content would be good for your thumbnail-ified images.

If you're wanting to create your own engine, you may want to study into DOM positioning analytics, and try to find your own algorithm by scraping many, many articles and web pages to try and find good patterns.

Upvotes: 4

Related Questions