t0mas
t0mas

Reputation: 147

Issue in Web Scraping

I'm trying to write a program in PyCharm (i haven't used this IDE before but I don't think that's the issue) I'm having a trouble to collect the data from a certain class , "react-contextmenu-wrapper"

I want to write a script which parses a web-page on Spotify and creates a CSV file with the data, I've been following a tutorial on YouTube but I must have gone wrong somewhere.

Here is my code:

from urllib.request import urlopen as Req
from bs4 import BeautifulSoup as soup

my_url = "https://open.spotify.com/genre/NMF-PopularNewReleases"

#grabs the contense of the page
Client = Req(my_url)

#reads the contense of the page
html = Client.read()

#outputs the page
Client.close()

page = soup(html, "html.parser")

#grabs each playlist
playlists = page.findAll("div",{"class":"react-contextmenu-wrapper"})

print(len(playlists))

But the terminal outputs an empty list, I know that the class exists. I can view it when I inspect the element of the page.

Upvotes: 2

Views: 296

Answers (1)

Paul M.
Paul M.

Reputation: 10799

The fact that the element's class is named react-contextmenu-wrapper is a big hint. React is a JavaScript library for creating user interfaces.

There are different ways in which a webpage can be populated with elements. If you're lucky, then the server will send you an .html with all the content basically baked in - this is trivial to scrape and parse with BeautifulSoup.

A lot of modern websites, however, populate a webpage's elements dynamically, using JavaScript for example. When you view a page like this in your browser, it's no problem - but if you try to scrape a page like this you'll end up just getting the "bare bones" template .html, where the DOM hasn't been populated fully. This is why BeautifulSoup can't see this element.

The point is: Just because you can see an element in your browser, doesn't mean BeautifulSoup will be able to see it.

Whenever you want to scrape a webpage, your first choice weapon should not be BeautifulSoup, in my opinion - you should use BeautifulSoup as a last resort. The first thing you should always do is check to see if the page you're scraping makes any requests to an API, the response of which it uses to populate itself. If it does, you can simply imitate a request to the same API, and get all the content you could ever want to scrape. As if that weren't good enough already, these APIs typically serve JSON, which is trivial to parse.

Here's how you do it:

  1. Open the webpage in your browser.
  2. If you're using Google Chrome, hit the F12 key to access the Developer Tools (other modern browsers should have a similar feature.)
  3. Open the "Network" tab. We will use Chrome's network logger to log all requests made by the webpage.
  4. Click on the funnel-shaped icon (which turns red when activated) to enable filtering.
  5. Click on XHR (this means view only XMLHttpRequests. These are the kind of requests we're interested in, because we want to see potential outgoing requests to APIs - we are not interested in capturing requests made to resources like images, etc.)
  6. Press Ctrl + E or click on the round record button in the top-left to enable logging (this one also turns red when activated.)
  7. Press Ctrl + R to refresh the page and start logging traffic.
  8. You should see the list of requests growing as the page loads after refreshing. It will look something like this (sorry for the large image):
  9. In my case, Spotify made nine XHR requests (on the left) during the time in which I logged traffic. I clicked on a few of them and inspected their "Headers" tab until I found one which had a "Request URL" that looked like it was talking to an API ("api" was part of the URL in this case.)

Time to write a simple script using requests. You'll want to copy the "Request URL", and make sure to copy a few "Request Headers" which look like they might be important for the request as well:

import requests

url = "https://api.spotify.com/v1/views/NMF-PopularNewReleases?timestamp=2020-03-22T14%3A26%3A36.760Z&platform=web&content_limit=10&limit=20&types=album%2Cplaylist%2Cartist%2Cshow%2Cstation&image_style=gradient_overlay&country=us&market=us&locale=en"

headers = {
    "Authorization": "Bearer BQDJeAA33JWAC_pQVUxoPpama63RFFIsovMjNOjq_odaPx9EfyMz1Bo494Xv4a20H9gM7Hu0OYZrO3QWs2E"
}

response = requests.get(url, headers=headers)
response.raise_for_status()

data = response.json()

for item in data["content"]["items"]:
    print(item["name"])

When I run this script, I get the following output:

New Music Friday
After Hours
Colores
kelsea
Actions
Kid Krow
Walk Em Down (feat. Roddy Ricch)
Studio It’s Hits
Intentions (Acoustic)
Creep
Is Everybody Going Crazy?
2 Seater (feat. G-Eazy & Offset)
Roses/Lotus/Violet/Iris
Spotify Singles
Between Us (feat. Snoh Aalegra)
E.P.
Black Men Don’t Cheat (feat. Ari Lennox, 6LACK & Tink)
Think About You
Pray 4 Love
Acrobats
>>> 

Feel free to inspect the JSON object returned by the API - you can do this in Python or just view the JSON's content by switching from the "Headers" tab to the "Preview" tab in the Chrome Developer Tools. In my example I just pulled the titles of the songs on the front page, but the JSON object contains a bunch more interesting stuff. You can also play around with the parameters in the URL's query string to grab more than 20 songs, etc.

Upvotes: 5

Related Questions