Tendekai Muchenje
Tendekai Muchenje

Reputation: 563

Scraping linked pages while still scraping original page

I am scraping a page using requests and bs4. I am trying to scrape links in the page but it still scrapes only the page I am originally on.

Specifically, I am scraping the page at https://untappd.com/v/beer-culture/893427 to get the beer names from that page. The menu section has a dropdown that links to different menus which also have the same page structure. I have been able to extract the links to the linked menu pages. See: print(menu_urls) in script. I have tried to iterate through the list of links creating a new soup for each and scraping it, but it only scrapes the original page n times where n is the length of the list of urls. So in my case, instead of scraping this list:

['https://untappd.com/v/beer-culture/893427?menu_id=1489', 'https://untappd.com/v/beer-culture/893427?menu_id=116472']

it only scrapes the original https://untappd.com/v/beer-culture/893427 twice.

Here is my script:

import requests
from bs4 import BeautifulSoup

venue_url = 'https://untappd.com/v/beer-culture/893427'
count = 0

response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')

def get_menu_beers(soup):
    global count
    menu = soup.find('div', {'class': 'menu-area'})
    beers_all = menu.find_all('ul', {'class': 'menu-section-list'})
    for beer_group in beers_all:
        beers = beer_group.find_all('li')
        for beer in beers:
            details = beer.find('div', {'class': 'beer-details'})
            name_ = details.find("a",{"class":"track-click"}).text
            count = count + 1
            print(count, ' ', name_)

select_options = soup.find_all('select', {'class':'menu-selector'})
options_list = select_options[0].find_all('option')
menu_ids =[]
for option in options_list:
    menu_ids.append(int(option['value']))

menu_urls = []
for menu_id in menu_ids:
    menu_url = str(venue_url)+ '?menu_id=' + str(menu_id)
    menu_urls.append(menu_url)

print(menu_urls)

for url in menu_urls:
    res = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
    s = BeautifulSoup(res.text, 'html.parser')
    get_menu_beers(s)

Upvotes: 0

Views: 44

Answers (2)

Joslen Caven
Joslen Caven

Reputation: 1

It looks like the issue is that you're always making requests to the original venue_url instead of using the correct url from the menu_urls list. In your last loop, you're still passing venue_url to requests.get(), so it's fetching the same page multiple times instead of the linked menu pages.

Replace this line inside your loop:

res = requests.get(venue_url, headers={'User-agent': 'Mozilla/5.0'})

with:

res = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})

Upvotes: 0

Charles Han
Charles Han

Reputation: 2010

In your last few lines of code, you should pass the url from the menus instead of venue_url:

for url in menu_urls:
    #### pass in url not venue_url ####
    res = requests.get(url, headers = {'User-agent': 'Mozilla/5.0'})
    s = BeautifulSoup(res.text, 'html.parser')
    get_menu_beers(s)

Upvotes: 1

Related Questions