user1987641
user1987641

Reputation:

preg_replace in simple html dom

I'm trying to grab the latest news from a website and include it on my own. This site uses Joomla (ugh) and the resulting content hrefs are missing the base href.

so links will hold contensite.php?blablabla which will result in links http://www.example.com/contensite.php?blablabla

So I thought of replacing http:// with http://www.basehref.com before echo-ing it out. but my knowledge stops here.

Which should I use: preg_replace, str_replace? I'm not sure.

Upvotes: 0

Views: 673

Answers (2)

user1987641
user1987641

Reputation:

include_once('db_connect.php');
// connect to my db
require_once('Net/URL2.php');
include_once('dom.php');
// include html_simple_dom!

$dom = file_get_html('http://www.targetsite.com');
// get the html content of a site and pass it through html simple dom !

$elem2 = $dom->find('div[class=blog]', 0);
// set the div to target for !


$uri = new Net_URL2('http://www.svvenray.nl'); // URI of the resource
$baseURI = $uri;
foreach ($elem2->find('base[href]') as $elem) {
$baseURI = $uri->resolve($elem->href);
}

foreach ($elem2->find('*[src]') as $elem) {
$elem->src = $baseURI->resolve($elem->src)->__toString();
}
foreach ($elem2->find('*[href]') as $elem) {
if (strtoupper($elem->tag) === 'BASE') continue;
$elem->href = $baseURI->resolve($elem->href)->__toString();
}

echo $elem2; 

This will fix all broken links, and requires PHP PEAR Net/URL2.php

Upvotes: 0

user1987641
user1987641

Reputation:

so im not able (since i lack the knowledge of preg matching) to fix the broken links , instead im replacing them with another link, and replacing the class of the link to my fancybox class, this way it will open the source site in a fancybox.

include_once('db_connect.php');
// connect to my db

include_once('dom.php');
// include html_simple_dom!

$dom = file_get_html('http://www.remotesite.com');
// get the html content of a site and pass it through html simple dom !

$elem = $dom->find('div[class=blog]', 0);
// set the div to target for !



$pattern = '/(?<=href\=")[^]]+?(?=")/';
$replacement ='http://www.remotesite.com';
$replacedHrefHtml = preg_replace($pattern, $replacement, $elem);
// replacement 1
// replace the broken links (base href is missing , joomla sucks , period !)
// im to lazy to preg_match it any other way, feel free to improve this !

$pattern2 = '/contentpagetitle/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links with the class contentpagetitle to my fancybox     class ! fancy innit!


$pattern2 = '/readon/';
$replacement2 ='fancybox fancybox.iframe';
$replacedHrefHtml2 = preg_replace($pattern2, $replacement2,$replacedHrefHtml );
// replacement 2
// replace the joomla class on the links  with class readon to my fancybox class ! fancy innit!

$replacedHrefHtml3 = preg_replace("/<img[^>]+\>/i", "<br />(Plaatje)<br /><br /> ",         $replacedHrefHtml2); 
// finally remove the images from the string !


$replacedHrefHtml4 = base64_encode($replacedHrefHtml3);
// encode the html with base64 before store to mysel 
// real escape wont work since it will break the links !

 try {
$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$data222 = $conn->query('SELECT * FROM svvnieuws ORDER BY id DESC LIMIT 1');

foreach($data222 as $row) { 

 $lastitem = sprintf($row[inhoud]);

   }
 } catch(PDOException $e) {
echo 'ERROR: ' . $e->getMessage();
}                        
// get the last stored item in db for comparisation to current result!

if ($replacedHrefHtml4 == $lastitem){
// if the last item from the db is the same, do not store a new item ! importand to prevent clutter !

}
else {
// if its not the same, store a new item !

$conn = new PDO($link, $pdo_username, $pdo_password);
$conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// set up the connection to the db

$sql='INSERT INTO svvnieuws (id,inhoud) VALUES ("","'.$replacedHrefHtml4.'")';
// set the mysql query string

$rip = $conn->prepare($sql);
$rip->execute(array(':id'=>$id,
              ':inhoud'=>$replacedHrefHtml4
              ));
// insert to the db !

}
// close the else !

// place this file outside of the docroot, and let the cron run it every say 4 hours. 
// ofcourse make sure you also place dom.php in the same directory!
// dom.php is my short name for php simple html dom.

So replace 1 replaces the < a href="whatver"> to < a href="www.remotesite.com">
replace 2 replaces the class on that href to fancybox replace 3 replaces the class on the readon link to fancybox compare to last stored item if different store it.

I would love to figure out, how to fix the broken links instead of replacing them. Links from the site are in source as followed : < a href="/index.php?blabla"> How , if at al possible would i be able to inject www.mysite.com into < a href="/index.php?blabla"> making it < a href="www.remotesite.com/index.php?blabla">

Upvotes: 0

Related Questions