Reputation: 147
I'm trying to write a program in PyCharm (i haven't used this IDE before but I don't think that's the issue) I'm having a trouble to collect the data from a certain class , "react-contextmenu-wrapper"
I want to write a script which parses a web-page on Spotify and creates a CSV file with the data, I've been following a tutorial on YouTube but I must have gone wrong somewhere.
Here is my code:
from urllib.request import urlopen as Req
from bs4 import BeautifulSoup as soup
my_url = "https://open.spotify.com/genre/NMF-PopularNewReleases"
#grabs the contense of the page
Client = Req(my_url)
#reads the contense of the page
html = Client.read()
#outputs the page
Client.close()
page = soup(html, "html.parser")
#grabs each playlist
playlists = page.findAll("div",{"class":"react-contextmenu-wrapper"})
print(len(playlists))
But the terminal outputs an empty list, I know that the class exists. I can view it when I inspect the element of the page.
Upvotes: 2
Views: 296
Reputation: 10799
The fact that the element's class is named react-contextmenu-wrapper
is a big hint. React is a JavaScript library for creating user interfaces.
There are different ways in which a webpage can be populated with elements. If you're lucky, then the server will send you an .html
with all the content basically baked in - this is trivial to scrape and parse with BeautifulSoup.
A lot of modern websites, however, populate a webpage's elements dynamically, using JavaScript for example. When you view a page like this in your browser, it's no problem - but if you try to scrape a page like this you'll end up just getting the "bare bones" template .html
, where the DOM hasn't been populated fully. This is why BeautifulSoup can't see this element.
The point is: Just because you can see an element in your browser, doesn't mean BeautifulSoup will be able to see it.
Whenever you want to scrape a webpage, your first choice weapon should not be BeautifulSoup, in my opinion - you should use BeautifulSoup as a last resort. The first thing you should always do is check to see if the page you're scraping makes any requests to an API, the response of which it uses to populate itself. If it does, you can simply imitate a request to the same API, and get all the content you could ever want to scrape. As if that weren't good enough already, these APIs typically serve JSON, which is trivial to parse.
Here's how you do it:
F12
key to access the
Developer Tools (other modern browsers should have a similar
feature.)Ctrl + E
or click on the round record button in the top-left
to enable logging (this one also turns red when activated.)Ctrl + R
to refresh the page and start logging traffic.Time to write a simple script using requests
. You'll want to copy
the "Request URL", and make sure to copy a few "Request Headers"
which look like they might be important for the request as well:
import requests
url = "https://api.spotify.com/v1/views/NMF-PopularNewReleases?timestamp=2020-03-22T14%3A26%3A36.760Z&platform=web&content_limit=10&limit=20&types=album%2Cplaylist%2Cartist%2Cshow%2Cstation&image_style=gradient_overlay&country=us&market=us&locale=en"
headers = {
"Authorization": "Bearer BQDJeAA33JWAC_pQVUxoPpama63RFFIsovMjNOjq_odaPx9EfyMz1Bo494Xv4a20H9gM7Hu0OYZrO3QWs2E"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
data = response.json()
for item in data["content"]["items"]:
print(item["name"])
When I run this script, I get the following output:
New Music Friday
After Hours
Colores
kelsea
Actions
Kid Krow
Walk Em Down (feat. Roddy Ricch)
Studio It’s Hits
Intentions (Acoustic)
Creep
Is Everybody Going Crazy?
2 Seater (feat. G-Eazy & Offset)
Roses/Lotus/Violet/Iris
Spotify Singles
Between Us (feat. Snoh Aalegra)
E.P.
Black Men Don’t Cheat (feat. Ari Lennox, 6LACK & Tink)
Think About You
Pray 4 Love
Acrobats
>>>
Feel free to inspect the JSON object returned by the API - you can do this in Python or just view the JSON's content by switching from the "Headers" tab to the "Preview" tab in the Chrome Developer Tools. In my example I just pulled the titles of the songs on the front page, but the JSON object contains a bunch more interesting stuff. You can also play around with the parameters in the URL's query string to grab more than 20 songs, etc.
Upvotes: 5