azz0r
azz0r

Reputation: 3311

How does Facebook Link tear down a page?

So when a user pastes a link into facebook status, it fires off a call to get the details of that page.

What I'm wondering is if anyone has any similar functions to tear apart a page?

Having thought about it, getting the is just matching some regular expression.

It then usually gets an array of images, also fairly easy todo with regular expression and maybe filtering images too small.

I'm alittle baffled how it figures out what bit of text is relevant, any ideas?

Upvotes: 2

Views: 244

Answers (3)

ifaour
ifaour

Reputation: 38135

It's worth mentioning that since the introduction of the Open Graph support, Facebook is saving so much time and server load when parsing (scraping) pages that uses the protocol.

Check out the PHP implementation for more info, and here's a small example using one of the libraries (OpenGraphNode in PHP):

include "OpenGraphNode.php";

# Fetch and parse a URL
#
$page = "http://www.rottentomatoes.com/m/oceans_eleven/";
$node = new OpenGraphNode($page);

# Retrieve the title
#
print $node->title . "\n";    # like this
print $node->title() . "\n";  # or with parentheses

# And obviously the above works for other Open Graph Protocol
# properties like "image", "description", etc. For properties
# that contain a hyphen, you'll need to use underscore instead:
#
print $node->street_address . "\n";

# OpenGraphNode uses PHP5's Iterator feature, so you can
# loop through it like an array.
#
foreach ($node as $key => $value) {
    print "$key => $value\n";
}

Upvotes: 0

dqhendricks
dqhendricks

Reputation: 19251

Regular expressions are bad for parsing html because of its leveled structure. you will want to use the DOMDocument class.

http://www.php.net/manual/en/class.domdocument.php

This will turn the page source into an XML object. You should be able to figure out how to get the relevent details using XPath queries fairly easily.

you may also want to take a look at the php function get_meta_tags().

http://www.php.net/manual/en/function.get-meta-tags.php

Upvotes: 0

Sushant
Sushant

Reputation: 1003

Perhaps looking at an article extractor like Goose might help?

Upvotes: 1

Related Questions