Reputation: 503
Im trying to first grab all the links from a page, when get the URL for the "next" button and keep looping until there are no more pages. Been trying to get a nested loop going to achieve that but for some reason BeautifulSoup does never parse the second page.. only the first one and then stops..
Hard to explain but here is the code should be easier to understand what i am trying to explain :)
#this site holds the first page that it should start looping on.. from this page i want to reach page 2, 3, etc.
webpage = urlopen('www.first-page-with-urls-and-next-button.com').read()
soup = BeautifulSoup(webpage)
for tag in soup.findAll('a', { "class" : "next" }):
print tag['href']
print "\n--------------------\n"
#next button is relative url so append it to main-url.com
soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
#for some reason this variable only holds the tag['href']
print soup
for taggen in soup.findAll('a', { "class" : "homepage target-blank" }):
print tag['href']
# Read page found
sidan = urlopen(taggen['href']).read()
# get title
Titeln = re.findall(patFinderTitle, sidan)
print Titeln
Any ideas? So sorry for poor english, i hope i will not get hammered :) Please ask if i have explained it to poorly i will do my best to explain some more. Oh and i am new to Python - as of today (as you might have figured:)
Upvotes: 2
Views: 5345
Reputation: 413
For the line:
soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
try:
webpage = urlopen('http://www.main-url.com/'+re.sub(r'\s','',tag['href'])).read()
soup = BeautifulSoup(webpage)
Upvotes: 0
Reputation: 311298
If you call urlopen
on the new url and pass the resulting file object to BeatifulSoup I think you'll be all set. That is:
wepage = urlopen(http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
soup = BeautifulSoup(webpage)
Upvotes: 2