Reputation: 321
I wrote a script to scrape quotes to scrape quotes and name of authors. In this project I use requests to get page's code and bs4 to parse HTML. I use a while loop to go through pagination link to next pages but I want my code to stop running when there is no page left. My code works, but it wont stop running.
Here is my code:
from bs4 import BeautifulSoup as bs
import requests
def scrape():
page = 1
url = 'http://quotes.toscrape.com'
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
p_link = soup.find('a',text="Next")
condition = True
while condition:
with open('quotes.txt','a') as f:
for i in range(len(authors)):
f.write(quotes[i].text+' '+authors[i].text+'\n')
if p_link not in soup:
condition = False
page += 1
url = 'http://quotes.toscrape.com/page/{}'.format(page)
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
condition = True
else:
condition = False
print('done')
scrape()
Upvotes: 0
Views: 1447
Reputation:
Because p_link
is never in soup. I find two reasons for this.
You search it using the text 'Next'. But It seems as if the actual link is 'Next' + whitespace + right arrow
The tag contains the attribute 'href' which points to the next page. For Each page this will have a different value.
Also there is no difference making condition as False inside the while loop for the first if block. You are setting it back at the end of the block anyways.
So...
Instead of search by Next, use:
soup.find('li',attrs={"class":"next"})
And for the condition, use :
if soup.find('li',attrs={"class":"next"}) is None:
condition = False
Finally, if you want to write the quotes from last page too, I suggest you put the 'writing to file' part in the end. Or avoid it altogether..like this :
from bs4 import BeautifulSoup as bs
import requests
def scrape():
page = 1
while True:
if page == 1:
url = 'http://quotes.toscrape.com'
else:
url = 'http://quotes.toscrape.com/page/{}'.format(page)
r = requests.get(url)
soup = bs(r.text,'html.parser')
quotes = soup.find_all('span',attrs={"class":"text"})
authors = soup.find_all('small',attrs={"class":"author"})
with open('quotes.txt','a') as f:
for i in range(len(authors)):
f.write(str(quotes[i].encode("utf-8"))+' '+str(authors[i].encode("utf-8"))+'\n')
if soup.find('li',attrs={"class":"next"}) is None:
break
page+=1
print('done')
scrape()
Upvotes: 2