Reputation: 3311
So when a user pastes a link into facebook status, it fires off a call to get the details of that page.
What I'm wondering is if anyone has any similar functions to tear apart a page?
Having thought about it, getting the is just matching some regular expression.
It then usually gets an array of images, also fairly easy todo with regular expression and maybe filtering images too small.
I'm alittle baffled how it figures out what bit of text is relevant, any ideas?
Upvotes: 2
Views: 244
Reputation: 38135
It's worth mentioning that since the introduction of the Open Graph support, Facebook is saving so much time and server load when parsing (scraping) pages that uses the protocol.
Check out the PHP implementation for more info, and here's a small example using one of the libraries (OpenGraphNode in PHP):
include "OpenGraphNode.php";
# Fetch and parse a URL
#
$page = "http://www.rottentomatoes.com/m/oceans_eleven/";
$node = new OpenGraphNode($page);
# Retrieve the title
#
print $node->title . "\n"; # like this
print $node->title() . "\n"; # or with parentheses
# And obviously the above works for other Open Graph Protocol
# properties like "image", "description", etc. For properties
# that contain a hyphen, you'll need to use underscore instead:
#
print $node->street_address . "\n";
# OpenGraphNode uses PHP5's Iterator feature, so you can
# loop through it like an array.
#
foreach ($node as $key => $value) {
print "$key => $value\n";
}
Upvotes: 0
Reputation: 19251
Regular expressions are bad for parsing html because of its leveled structure. you will want to use the DOMDocument class.
http://www.php.net/manual/en/class.domdocument.php
This will turn the page source into an XML object. You should be able to figure out how to get the relevent details using XPath queries fairly easily.
you may also want to take a look at the php function get_meta_tags().
http://www.php.net/manual/en/function.get-meta-tags.php
Upvotes: 0