mtf-Whitney
mtf-Whitney

Reputation: 55

Can a href be hidden from a scrape using beautifulsoup?

I find this in my inspect for the website

<a data-bind="attr: { 'href': bandURL }, text: artist, click: playMe"  
class="item-artist" href="https://bogseyandtheargonauts.bandcamp.com?
from=discover-top">Bogsey</a>

When I scrape I only get this

<a class="item-artist" data-bind="attr: { 'href': bandURL }, text: 
 artist, click: playMe"/a>

I am trying to find the link and for some reason the href is hidden, is there a way to hide the link from a scrape or am i not using the proper tools. I know the code to find the href but right now im simply trying to return the whole tag with the href value

class BandCamp:
    def Search(self):
    page = requests.get("https://bandcamp.com/?g=punk&s=top&p=0&gn=0&f=all&t=folk-punk")
    data = page.content
    soup = BeautifulSoup(data,'lxml')
    for top in soup.find_all('div', {'class':'col col-3-12 discover-item'}):
        link = top.find('a')
        print(top)

bc = BandCamp()
bc.Search()

Upvotes: 2

Views: 1006

Answers (1)

alecxe
alecxe

Reputation: 474001

The data you are looking for is actually in the HTML response, but it is inside the data-blob attribute of an element with id="pagedata". This data is being processed by JavaScript executed in the browser; requests though is not a browser and it would only download you an initial "unrendered" page.

Here is how you can locate the element with the "page data" and load it into Python dictionary:

import json
from pprint import pprint

from bs4 import BeautifulSoup
import requests


page = requests.get("https://bandcamp.com/?g=punk&s=top&p=0&gn=0&f=all&t=folk-punk")
data = page.content
soup = BeautifulSoup(data, 'lxml')

page_data = soup.find(id="pagedata")["data-blob"]
page_data = json.loads(page_data)

pprint(page_data)

Upvotes: 1

Related Questions