Reputation:
I'm trying to grab the latest news from a website and include it on my own. This site uses Joomla (ugh) and the resulting content hrefs are missing the base href.
so links will hold contensite.php?blablabla
which will result in links http://www.example.com/contensite.php?blablabla
So I thought of replacing http://
with http://www.basehref.com
before echo-ing it out. but my knowledge stops here.
Which should I use: preg_replace
, str_replace
? I'm not sure.
Upvotes: 0
Views: 673
Reputation:
include_once('db_connect.php');
// connect to my db
require_once('Net/URL2.php');
include_once('dom.php');
// include html_simple_dom!
$dom = file_get_html('http://www.targetsite.com');
// get the html content of a site and pass it through html simple dom !
$elem2 = $dom->find('div[class=blog]', 0);
// set the div to target for !
$uri = new Net_URL2('http://www.svvenray.nl'); // URI of the resource
$baseURI = $uri;
foreach ($elem2->find('base[href]') as $elem) {
$baseURI = $uri->resolve($elem->href);
}
foreach ($elem2->find('*[src]') as $elem) {
$elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($elem2->find('*[href]') as $elem) {
if (strtoupper($elem->tag) === 'BASE') continue;
$elem->href = $baseURI->resolve($elem->href)->__toString();
}
echo $elem2;
This will fix all broken links, and requires PHP PEAR Net/URL2.php
Upvotes: 0
Reputation:
so im not able (since i lack the knowledge of preg matching) to fix the broken links , instead im replacing them with another link, and replacing the class of the link to my fancybox class, this way it will open the source site in a fancybox.
include_once('db_connect.php');
// connect to my db
include_once('dom.php');
// include html_simple_dom!
$dom = file_get_html('http://www.remotesite.com');
// get the html content of a site and pass it through html simple dom !
$elem = $dom->find('div[class=blog]', 0);
// set the div to target for !
$pattern = '/(?<=href\=")[^]]+?(?=")/';
$replacement ='http://www.remotesite.com';
$replacedHrefHtml = preg_replace($pattern, $replacement, $elem);
// replacement 1
// replace the broken links (base href is missing , joomla sucks , period !)
// im to lazy to preg_match it any other way, feel free to improve this !
$pattern2 = '/contentpagetitle/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links with the class contentpagetitle to my fancybox class ! fancy innit!
$pattern2 = '/readon/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links with class readon to my fancybox class ! fancy innit!
$replacedHrefHtml3 = preg_replace("/<img[^>]+\>/i", "<br />(Plaatje)<br /><br /> ", $replacedHrefHtml2);
// finally remove the images from the string !
$replacedHrefHtml4 = base64_encode($replacedHrefHtml3);
// encode the html with base64 before store to mysel
// real escape wont work since it will break the links !
try {
$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
$data222 = $conn->query('SELECT * FROM svvnieuws ORDER BY id DESC LIMIT 1');
foreach($data222 as $row) {
$lastitem = sprintf($row[inhoud]);
}
} catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}
// get the last stored item in db for comparisation to current result!
if ($replacedHrefHtml4 == $lastitem){
// if the last item from the db is the same, do not store a new item ! importand to prevent clutter !
}
else {
// if its not the same, store a new item !
$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// set up the connection to the db
$sql='INSERT INTO svvnieuws (id,inhoud) VALUES ("","'.$replacedHrefHtml4.'")';
// set the mysql query string
$rip = $conn->prepare($sql);
$rip->execute(array(':id'=>$id,
':inhoud'=>$replacedHrefHtml4
));
// insert to the db !
}
// close the else !
// place this file outside of the docroot, and let the cron run it every say 4 hours.
// ofcourse make sure you also place dom.php in the same directory!
// dom.php is my short name for php simple html dom.
So replace 1 replaces the
< a href="whatver"> to < a href="www.remotesite.com">
replace 2 replaces the class on that href to fancybox
replace 3 replaces the class on the readon link to fancybox
compare to last stored item
if different store it.
I would love to figure out, how to fix the broken links instead of replacing them. Links from the site are in source as followed : < a href="/index.php?blabla"> How , if at al possible would i be able to inject www.mysite.com into < a href="/index.php?blabla"> making it < a href="www.remotesite.com/index.php?blabla">
Upvotes: 0