Reputation: 3975
Search engine bots crawl the web and download each page they go to for analysis, right?
How exactly do they download a page? What way do they store the pages?
I am asking because I want to run an analysis on a few webpages. I could scrape the page by going to the address but wouldn't it make more sense to download the pages to my computer and work on them from there?
Upvotes: 3
Views: 337
Reputation: 6762
Try HTTrack
About the way they do it:
The indexing starts from a designated starting point (an entrance if you prefer). From there, the spider follows recursively all hyperlinks until a given depth.
Search engine spiders work like this as well, but there are many crawling simultaneously and there are other factors that count. For example a newly created post here in SO will be picked up by google very fast, but an update at a low traffic web site will be picked up even days later.
Upvotes: 7
Reputation: 3767
You can use the debugging tools built into Firefox (or firebug) and Chrome to examine how the page works. As far as downloading them directly, I am not sure. You could maybe try viewing the page source in your browser, and then copy and paste the code.
Upvotes: 2