Ethan
Ethan

Reputation: 321

how to make a crawler to scrape website with bs4

I wrote a script to scrape quotes to scrape quotes and name of authors. In this project I use requests to get page's code and bs4 to parse HTML. I use a while loop to go through pagination link to next pages but I want my code to stop running when there is no page left. My code works, but it wont stop running.

Here is my code:

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    url = 'http://quotes.toscrape.com'
    r = requests.get(url)
    soup = bs(r.text,'html.parser')
    quotes = soup.find_all('span',attrs={"class":"text"})
    authors = soup.find_all('small',attrs={"class":"author"})
    p_link = soup.find('a',text="Next")

    condition = True
    while condition:
        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(quotes[i].text+' '+authors[i].text+'\n')
        if p_link not in soup:
            condition = False
            page += 1
            url = 'http://quotes.toscrape.com/page/{}'.format(page)
            r = requests.get(url)
            soup = bs(r.text,'html.parser')
            quotes = soup.find_all('span',attrs={"class":"text"})
            authors = soup.find_all('small',attrs={"class":"author"})
            condition = True
        else:
            condition = False

    print('done')


scrape()

Upvotes: 0

Views: 1447

Answers (1)

user8073978
user8073978

Reputation:

Because p_link is never in soup. I find two reasons for this.

  1. You search it using the text 'Next'. But It seems as if the actual link is 'Next' + whitespace + right arrow

  2. The tag contains the attribute 'href' which points to the next page. For Each page this will have a different value.

Also there is no difference making condition as False inside the while loop for the first if block. You are setting it back at the end of the block anyways.

So...

Instead of search by Next, use:

soup.find('li',attrs={"class":"next"})

And for the condition, use :

if soup.find('li',attrs={"class":"next"}) is None:
   condition = False

Finally, if you want to write the quotes from last page too, I suggest you put the 'writing to file' part in the end. Or avoid it altogether..like this :

from bs4 import BeautifulSoup as bs
import requests

def scrape():
    page = 1
    while True:

        if page == 1:
            url = 'http://quotes.toscrape.com'
        else:
            url = 'http://quotes.toscrape.com/page/{}'.format(page)

        r = requests.get(url)
        soup = bs(r.text,'html.parser')

        quotes = soup.find_all('span',attrs={"class":"text"})
        authors = soup.find_all('small',attrs={"class":"author"})

        with open('quotes.txt','a') as f:
            for i in range(len(authors)):
                f.write(str(quotes[i].encode("utf-8"))+' '+str(authors[i].encode("utf-8"))+'\n')       

        if soup.find('li',attrs={"class":"next"}) is None:
            break

        page+=1

    print('done')


scrape()

Upvotes: 2

Related Questions