Reputation: 145
I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.
If I download the page and extract image using <img>
tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.
Mithun
Duplicate of : How to find and extract "main" image in website
Upvotes: 10
Views: 4073
Reputation: 294
It's been a long time. But this may help next time.
You can use this API https://urlmeta.org/
It's very simple to use and result is the best we need.
example for using API:
<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";
$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);
?>
And that's the result you needed.
Upvotes: 5
Reputation: 15925
Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...
I think something like:
then take the image with the most points and throw the rest away
Probably works for majority of sites.
(Would require some fiddling with the heuristics though)
Upvotes: 8
Reputation: 145
I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.
It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in
Mithun
ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.
Upvotes: 3
Reputation: 349
I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).
Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.
Upvotes: 1