Randy
Randy

Reputation: 1

How to search webpage (gotten from cURL) for a link?

I need a function that will be able to search the $get_webpage variable to see if it contains my sites link code ($linktext). The function should be able to search the whole webpage for $linktext, which should only be placed after <body> and before </body> tag. Thanks for all your help.


[[UPDATE]] Hi guys, quick update, let me clarify the link code on the example.com webpage which contains rel="nofollow" should not work, example:

<a href="mysite.com/"; rel="nofollow"><strong>My Site</strong></a>

    $cc = new cURL();
    $get_webpage=$cc->get('http://www.example.com');
    $linktext='<a href="http://www.mysite.com/"><strong>My Site</strong></a>';



//####################################################################
//GET URL FUNCTION
//####################################################################
class cURL {
var $headers;
var $user_agent;
var $compression;
var $cookie_file;
var $proxy;
function cURL($cookies=TRUE,$cookie='cookie.txt',$compression='gzip',$proxy='') {
$this->headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$this->headers[] = 'Connection: Keep-Alive';
$this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
$this->user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)';
$this->compression=$compression;
$this->proxy=$proxy;
$this->cookies=$cookies;
if ($this->cookies == TRUE) $this->cookie($cookie);
}
function cookie($cookie_file) {
if (file_exists($cookie_file)) {
$this->cookie_file=$cookie_file;
} else {
fopen($cookie_file,'w') or $this->error('The cookie file could not be opened. Make sure this directory has the correct permissions');
$this->cookie_file=$cookie_file;
fclose($this->cookie_file);
}
}
function get($url) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process,CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_MAXREDIRS, 2);
$return = curl_exec($process);
curl_close($process);
return $return;
}
function post($url,$data) {
$process = curl_init($url);
curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
curl_setopt($process, CURLOPT_HEADER, 1);
curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);
curl_setopt($process, CURLOPT_ENCODING , $this->compression);
curl_setopt($process, CURLOPT_TIMEOUT, 30);
if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
curl_setopt($process, CURLOPT_POSTFIELDS, $data);
curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($process, CURLOPT_MAXREDIRS, 2);
curl_setopt($process, CURLOPT_POST, 1);
$return = curl_exec($process);
curl_close($process);
return $return;
}
function error($error) {
$fp = fopen("error.txt","w") or die ();     
$error_text="cURL Error:$error\n";
fputs($fp,$error_text); 
fclose($fp) or die (); 
die;
}
} 
//######################################################################
//END URL FUNCTION
//#######################################################################

Upvotes: 0

Views: 540

Answers (4)

Dejan Marjanović
Dejan Marjanović

Reputation: 19380

I didn't knew that anchors can be outside body tag :)

First extract inner HTML of body tags, with preg_match... you can then use regular strpos for searching if you know exactly what link looks like in HTML.

Upvotes: 0

prodigitalson
prodigitalson

Reputation: 60413

The following will do it all with xpath but assumes that you want the qualification that My Site must be within a strong tag:

function findLinks($html, $href, $text)
{
   $dom = new SimpleXmlDocument($html);

   $links = $dom->xpath("//a[@href='$url']/strong[contains(., '$text')]");

   if(count($links) > 0)
   {
     return true;
   }

   return false;
}

If you dont care about the strong tag you could use an xpath like:

//a[@href='$url'][contains(., '$text')]

Do some research on XPath to see whats possible. You could ofcourse jsut use a simple XPath to get all the a tags and then loop over them looking for your qualifiers as another poster suggested.

Upvotes: 0

Christian
Christian

Reputation: 28165

There are 4 ways to do this (that I know)

  • XML
  • DOM
  • Manual Parsing
  • Regular Expressions

I suggest the first two, perhaps DOM more than XML. See Byron's example, it ought to do the trick.

Upvotes: 0

Byron Whitlock
Byron Whitlock

Reputation: 53921

You can use the dom handling functions

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);    
    foreach($x->query("//a") as $node)
    {
        if ($node->getAttribute("href") == "http://mysite.com")
        {
            // we got the link via href
        }
        if ($node->textContent == "http://mysite.com")
        {
            // we got the link via text
        }
    }

Upvotes: 1

Related Questions