Justin k
Justin k

Reputation: 1104

C# regular expression in PHP?

I want my PHP program to extract all the URLs from a html file. When I was writing a C# program to extract all the URL in a html file, I used the following regular expression. Then add "http" part to the beginning to get a full URL list. Can you please tell me how can I use the regular expression that I used in the following code to work with PHP?

        List<string> links = new List<string>();
        Regex regEx;
        Match matches;

        regEx = new Regex("href=\"http\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))\"", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        for (matches = regEx.Match(downloadString); matches.Success; matches = matches.NextMatch())
        {
            links.Add("http" + matches.Groups[1].ToString());
        } //Add all the URLs to a list and return the list

        return links;

I would really appreciate it if you can show it to me with an example:


@julian Thank you so much for the detailed explanation. I modified the code a little and used it in the following way:

$html = file_get_contents('http://mysmallwebpage.com/');
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');

foreach ($links as $link)
{      
    $returnLink =  $link->getAttribute('href');
echo "<br />",$returnLink;
}

but the result doesn't show exact URL address. it output things like:

/nmsd-gallery/
/home/?currentPage=3
javascript:noop();

Can you please tell me if there is a way I can get just the URL address? such as: http://mysmallwebpage.com/

Upvotes: 0

Views: 88

Answers (2)

julian
julian

Reputation: 36

mhm this are internal links of the page .. in this case you have to filter the js-links (or other unwanted files like images or so) and add the HTTP_REFERER as prefix

...

foreach ($links as $link)
{      
    $returnLink =  $link->getAttribute('href');
    if (stripos($returnLink,"javascript")!=false) // or other unwanted calls
    {
        if (stripos($returnLink,"http://") ==false)
        {
            $retunLink = $_SERVER['HTTP_REFERER'].$returnLink
        }
    } 
echo "<br />++",$returnLink;
}

there are much more cases to check .. but i think this gives you an example ...

Upvotes: 1

julian
julian

Reputation: 36

Try extracting URL's with the DOM-Framework

    $html = file_get_contents($aktPage);
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    $links = $dom->getElementsByTagName('a');

    foreach ($links as $link)
    {      
            $returnLinks[] =  $link->getAttribute('href');
    }

Upvotes: 1

Related Questions