Reputation: 33
I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Upvotes: 0
Views: 12795
Reputation: 414179
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
Upvotes: 0
Reputation: 298166
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
Upvotes: 5
Reputation: 658
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want
Upvotes: -2