user1213488
user1213488

Reputation: 503

Python BeautifulSoup - Looping through multiple pages

Im trying to first grab all the links from a page, when get the URL for the "next" button and keep looping until there are no more pages. Been trying to get a nested loop going to achieve that but for some reason BeautifulSoup does never parse the second page.. only the first one and then stops..

Hard to explain but here is the code should be easier to understand what i am trying to explain :)

#this site holds the first page that it should start looping on.. from this page i want to reach page 2, 3, etc.
   webpage = urlopen('www.first-page-with-urls-and-next-button.com').read()

soup = BeautifulSoup(webpage)

for tag in soup.findAll('a', { "class" : "next" }):

    print tag['href']
    print "\n--------------------\n"


#next button is relative url so append it to main-url.com
    soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))

#for some reason this variable only holds the tag['href']
    print soup

    for taggen in soup.findAll('a', { "class" : "homepage target-blank" }):
        print tag['href']

        # Read page found
        sidan = urlopen(taggen['href']).read()

# get title
        Titeln = re.findall(patFinderTitle, sidan)

        print Titeln

Any ideas? So sorry for poor english, i hope i will not get hammered :) Please ask if i have explained it to poorly i will do my best to explain some more. Oh and i am new to Python - as of today (as you might have figured:)

Upvotes: 2

Views: 5345

Answers (2)

James Thiele
James Thiele

Reputation: 413

For the line:

soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))

try:

webpage = urlopen('http://www.main-url.com/'+re.sub(r'\s','',tag['href'])).read()

soup = BeautifulSoup(webpage)

Upvotes: 0

larsks
larsks

Reputation: 311298

If you call urlopen on the new url and pass the resulting file object to BeatifulSoup I think you'll be all set. That is:

wepage = urlopen(http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
soup = BeautifulSoup(webpage)

Upvotes: 2

Related Questions