Kamikaze_goldfish
Kamikaze_goldfish

Reputation: 861

While loop with beautiful soup and python

Alright. Now I am really stumped. I am scraping data with beautiful soup and the pages have a structured format such as the link is https://www.brightscope.com/ratings/a the ratings go through other. Each letter, such as a, b, c, ..., after ratings has multiple pages for them. I am trying to create a while loop to go to each page and while a certain condition exists scrape all the hrefs (which I haven't gotten to that code yet). However, when I run the code the while loop continues to run non stop. How can I fix it to go to each page and search for a condition to run then if it isn't found go to the next letter? Before any might ask, I have searched the code and don't see any li tags while it continues to run.

For instance: https://www.brightscope.com/ratings/A/18 is the highest it will go for the A's but it keeps running.

import requests
from bs4 import BeautifulSoup

url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []

for href in soup.findAll('a'):
    if 'href' in href.attrs:
        hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
    if good_ratings.startswith('/ratings/'):
        ratings.append(url[:-9]+good_ratings)

del ratings[0]
del ratings[27:]
count = 1
# So it runs each letter a, b, c, ... 
for each_rating in ratings:
    #Pulls the page
    page = requests.get(each_rating)
    #Does its soup thing
    soup = BeautifulSoup(page.text, 'html.parser')
    #Supposed to stay in A, B, C,... until it can't find the 'li' tag
    while soup.find('li'):
        page = requests.get(each_rating+str(count))
        print(page.url)
        count = count+1
        #Keeps running this and never breaks
    else:
        count = 1
        break

Upvotes: 0

Views: 1840

Answers (2)

leotrubach
leotrubach

Reputation: 1597

BeautfulSoup's find() method finds first child. That means, if you need to go through all <li> elements you need to use findAll() method and iterate over its result.

Upvotes: 1

Deejpake
Deejpake

Reputation: 456

The soup.find('li') is never being changed. All you do in the while loop is update the variable page and count. You need to make a new soup with the page variable, then it will change. Maybe something like this

while soup.find('li'):
        page = requests.get(each_rating+str(count))
        soup = BeautifulSoup(page.text, 'html.parser')
        print(page.url)
        count = count+1
        #Keeps running this and never breaks

Hope this helps

Upvotes: 0

Related Questions