Edward Lin
Edward Lin

Reputation: 609

Use BeautifulSoup to loop through and retrieve specific URLs

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links.

I need to get and print always the 3rd URL on every list, while the previous URL (e.g. the 3rd URL on the first list) will lead to the 2nd list (and then need to get and print the 3rd URL and so on till the 4th retrieval).

Yet, my loop only achieves the first result (3rd URL on list 1), and I don't know how to loop the new URL back to the while loop and continue the process.

Here is my code:

import urllib.request
import json
import ssl
from bs4 import BeautifulSoup


num=int(input('enter count times: ' ))
position=int(input('enter position: ' ))

url='https://pr4e.dr-chuck.com/tsugi/mod/python-   
data/data/known_by_Fikret.html'
print (url)

count=0
order=0
while count<num:
    context = ssl._create_unverified_context()
    htm=urllib.request.urlopen(url, context=context).read()
    soup=BeautifulSoup(htm)
    for i in soup.find_all('a'):
        order+=1
        if order ==position:
            x=i.get('href')
            print (x)
    count+=1
    url=x        
print ('done')

Upvotes: 1

Views: 962

Answers (2)

Tales P&#225;dua
Tales P&#225;dua

Reputation: 1461

This is a good problem to use recursion. Try to call a recursive function to do this:

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness) 

and then call it with the desired parameters, in your case:

retrieve_urls_recur(url, 2, 0, 4)

2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively

ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

Upvotes: 1

alecxe
alecxe

Reputation: 473763

Just get the link from find_all() by index:

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1

Upvotes: 0

Related Questions