Reputation: 861
Alright. Now I am really stumped. I am scraping data with beautiful soup and the pages have a structured format such as the link is https://www.brightscope.com/ratings/a
the ratings go through other
. Each letter, such as a, b, c, ..., after ratings has multiple pages for them. I am trying to create a while loop to go to each page and while a certain condition exists scrape all the hrefs (which I haven't gotten to that code yet). However, when I run the code the while loop continues to run non stop. How can I fix it to go to each page and search for a condition to run then if it isn't found go to the next letter? Before any might ask, I have searched the code and don't see any li
tags while it continues to run.
For instance: https://www.brightscope.com/ratings/A/18
is the highest it will go for the A's but it keeps running.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
count = 1
# So it runs each letter a, b, c, ...
for each_rating in ratings:
#Pulls the page
page = requests.get(each_rating)
#Does its soup thing
soup = BeautifulSoup(page.text, 'html.parser')
#Supposed to stay in A, B, C,... until it can't find the 'li' tag
while soup.find('li'):
page = requests.get(each_rating+str(count))
print(page.url)
count = count+1
#Keeps running this and never breaks
else:
count = 1
break
Upvotes: 0
Views: 1840
Reputation: 1597
BeautfulSoup's find()
method finds first child. That means, if you need to go through all <li>
elements you need to use findAll() method and iterate over its result.
Upvotes: 1
Reputation: 456
The soup.find('li')
is never being changed. All you do in the while loop is update the variable page
and count
. You need to make a new soup with the page
variable, then it will change. Maybe something like this
while soup.find('li'):
page = requests.get(each_rating+str(count))
soup = BeautifulSoup(page.text, 'html.parser')
print(page.url)
count = count+1
#Keeps running this and never breaks
Hope this helps
Upvotes: 0