mithun
mithun

Reputation: 145

Extracting *relevant* image from a web-page

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website

Upvotes: 10

Views: 4073

Answers (4)

Pushpender Sharma
Pushpender Sharma

Reputation: 294

It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.

Upvotes: 5

Toad
Toad

Reputation: 15925

Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

  • Biggest resolution += 5pts
  • Biggest filesize += 10 pts
  • Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

Upvotes: 8

mithun
mithun

Reputation: 145

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

  1. Say the headline of the page I find is "this is a headline"
  2. I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

Upvotes: 3

Serkan
Serkan

Reputation: 349

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

Upvotes: 1

Related Questions