Syed BilawalHassan
Syed BilawalHassan

Reputation: 135

How to scrape whole website using beautifulsoup

I am trying to get all the unique urls of the website by calling the all_pages function recursively but this function is not giving all the urls of the website.

All I want to do is get all the unique urls of the website using BeautifulSoup. My code looks like this:

base_url = "http://www.readings.com.pk/"
unique_urls=[]

def all_pages(base_url,unique_urls=[]):

    response = requests.get(base_url)
    soup = BeautifulSoup(response.content, "html.parser")

    for link in soup.find_all("a"):
        url = link["href"]
        absolute_url = urljoin(base_url, url)
        if absolute_url not in unique_urls:

            if base_url in absolute_url:

                unique_urls.append(absolute_url)
                print (absolute_url)

                all_pages(absolute_url,unique_urls,book_urls)





all_pages(base_url,unique_urls)

Upvotes: 2

Views: 1763

Answers (1)

B.Adler
B.Adler

Reputation: 1539

Use response.text instead of response.content

Also, you need to return at some point. Additionally, instead of making unique_urls a list, make it a set and they will always be unique.

Additionally, your method is recursive and python has a max recursion depth, so maybe you should instead do this:

base_url = "http://www.readings.com.pk/"

def all_pages(base_url):

    response = requests.get(base_url)
    unique_urls = {base_url}
    visited_urls = set()
    while len(unique_urls) > len(visited_urls)
        soup = BeautifulSoup(response.text, "html.parser")

        for link in soup.find_all("a"):
            try:
                url = link["href"]
            except:
                continue
            absolute_url = base_url + url
            unique_urls.add(absolute_url)

        unvisited_url = (unique_urls - visited_urls).pop()
        visited_urls.add(unvisited_url)
        response = requests.get(unvisited_url)

    return unique_urls

all_pages(base_url)

Upvotes: 3

Related Questions