Reputation: 1104
I want my PHP program to extract all the URLs from a html file. When I was writing a C# program to extract all the URL in a html file, I used the following regular expression. Then add "http" part to the beginning to get a full URL list. Can you please tell me how can I use the regular expression that I used in the following code to work with PHP?
List<string> links = new List<string>();
Regex regEx;
Match matches;
regEx = new Regex("href=\"http\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))\"", RegexOptions.IgnoreCase | RegexOptions.Compiled);
for (matches = regEx.Match(downloadString); matches.Success; matches = matches.NextMatch())
{
links.Add("http" + matches.Groups[1].ToString());
} //Add all the URLs to a list and return the list
return links;
I would really appreciate it if you can show it to me with an example:
@julian Thank you so much for the detailed explanation. I modified the code a little and used it in the following way:
$html = file_get_contents('http://mysmallwebpage.com/');
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link)
{
$returnLink = $link->getAttribute('href');
echo "<br />",$returnLink;
}
but the result doesn't show exact URL address. it output things like:
/nmsd-gallery/
/home/?currentPage=3
javascript:noop();
Can you please tell me if there is a way I can get just the URL address? such as:
http://mysmallwebpage.com/
Upvotes: 0
Views: 88
Reputation: 36
mhm this are internal links of the page .. in this case you have to filter the js-links (or other unwanted files like images or so) and add the HTTP_REFERER as prefix
...
foreach ($links as $link)
{
$returnLink = $link->getAttribute('href');
if (stripos($returnLink,"javascript")!=false) // or other unwanted calls
{
if (stripos($returnLink,"http://") ==false)
{
$retunLink = $_SERVER['HTTP_REFERER'].$returnLink
}
}
echo "<br />++",$returnLink;
}
there are much more cases to check .. but i think this gives you an example ...
Upvotes: 1
Reputation: 36
Try extracting URL's with the DOM-Framework
$html = file_get_contents($aktPage);
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link)
{
$returnLinks[] = $link->getAttribute('href');
}
Upvotes: 1