Reputation: 8836
I am writing a php code that uses regex to get all the links from a page and I need to transform it to get the links from entire website. I guess the extracted urls should be checked again and so on, so that the script will access all the urls of it, not only the one given page.
I know that anything is possible, but how about this? Thank you for your guidance.
Upvotes: 2
Views: 2301
Reputation: 67195
Normally, you do not have access to the underlying server that allows you to retrieve all pages on the site.
So you just need to do what Google does: Get all links from the page and then scan those links for additional links.
Upvotes: 0
Reputation: 2160
This will get urls from url() (css), href and src attributes (links, imgs, scripts):
#(?:href|src)="([^"]+)|url\(["']?(.*?)["']?\)#i
They will be captured in group 1 and 2. Be aware that some urls can be relative, so you have to make them absolute before calling them.
Upvotes: 0
Reputation: 2104
Hmm, to ensure that you get all the pages that google have found, what about crawling google instead? Just search for "site:domain.com", and then retrieve anything that follows this pattern:
<h3 class="r"><a href="http://domain.com/.*?" class=l
(you'll have to escape the right characters as well, and the '.*?' is the RegEx that gives you all the urls that google finds.
Anyways, that's just a suggestion for an alternative approach.
Upvotes: 2
Reputation: 732
So, your regex grabs all the links. You cycle through a loop of those links, grab each with cURL, run that through your regex, wash, rinse, repeat.
Might want to make sure to put some sort of URL depth counter in there, lest you end up parsing The Internet.
Might also want to make sure you don't re-check links you've already followed, lest you end up at the end of Infinite Recursion Street.
Might also want to look at threading, lest it take 100,000 years.
Upvotes: 1