DJRodrigue
DJRodrigue

Reputation: 103

Trying to display text from web site using beautiful soup

I am trying to obtain the number of team members from each team on the list, right now I get all the team links, but instead of obtain all the links I want to obtain the links to the teams with at least 5 team members. How would I go about doing so? I tried but nothing worked so far.

    import time
    import requests
    from bs4 import BeautifulSoup


    def get_all(url, base):
        r = requests.get(url)
        page = r.text

        soup = BeautifulSoup(page, 'html.parser')

        for team_links in soup.select('div.details h3 a'):
            yield base + team_links['href']

        next_page = soup.find('div', {'class': 'pages'}).find('span', text='Next')


        while next_page:
            # Gives the server a break
            time.sleep(0.2)

            r = requests.get(BASE_URL + next_page.find_previous('a')['href'])
            page = r.text
            soup = BeautifulSoup(page)
            for team_links in soup.select('div.details h3 a'):
                yield BASE_URL + team_links['href']
            next_page = soup.find('div', {'class': 'pages'}).find('span', text='Next')


    if __name__ == '__main__':

        BASE_URL = 'http://www.gosugamers.net'
        URL = 'http://www.gosugamers.net/counterstrike/teams'

        for link in get_all(URL, BASE_URL):
              print (link)

Upvotes: 1

Views: 59

Answers (1)

alecxe
alecxe

Reputation: 473833

Locate the Members: label which goes further in the tree after the team link. Then, get the team members value, convert to integer and check if it is less than 5:

for team_links in soup.select('div.details h3 a'):
    members = int(team_links.find_next("th", text="Members:").find_next_sibling("td").text.strip())
    if members < 5:  # skip teams with less than 5 members
        continue
    yield base + team_links['href']

Note that this would fail in case there is a 1 (Pending: 1) instead of an integer value. Depending on whether you want to count the pending team members or not, there could be a different logic handling that.

For instance, if you don't want to count pending team members, we can just split by space and get the first item, ignoring what is inside "pending":

for team_links in soup.select('div.details h3 a'):
    members = int(team_links.find_next("th", text="Members:").find_next_sibling("td").text.strip().split()[0])
    # ...

Upvotes: 1

Related Questions