Reputation: 1273
There are some cases when you want to get a most representative image of a web page, e.g. Pocket would try to add an image when you collect a web page.
How would you define, in a programmatic way, which image is the key image? What would be the most appropriate way to do so?
Upvotes: 1
Views: 647
Reputation: 15834
Study scraper.py to see how reddit uses BeautifulSoup to find representative images from links submitted to it.
Upvotes: 2
Reputation: 838
Most websites that are seeking to be shared on sites like Facebook or Pocket will have an Open Graph protocol image. This is often an image in the head
tag that uses the format <meta property="og:image" content="http://URL-TO-YOUR-IMAGE" />
. The Open Graph protocol is used and looked for by companies such as Facebook, Pocket, Reddit, and has become fairly widespread in use.
For websites that do not follow such a standard, developers will often use a third-party tool such as Embedly, which has already solved the problem. Simply feed it a URL and it will return you some information on what content would be good for your thumbnail-ified images.
If you're wanting to create your own engine, you may want to study into DOM positioning analytics, and try to find your own algorithm by scraping many, many articles and web pages to try and find good patterns.
Upvotes: 4