Reputation: 322
The script I have below keeps returning an empty array when I try to write the contents to a JSON file. There are no errors that pop up when the script in run. It does not print anything in the terminal either. I have some similar scripts for other websites that are working perfectly. Here is my code. Thanks in advance.
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
openstax = 'https://cnx.org'
#opening up connection and grabbing page
uClient = urlopen(openstax)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"book"})
data = []
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.h3.a.text
data.append(item)
print(item['title'])
with open("./json/openstax.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
Upvotes: 0
Views: 234
Reputation: 22440
The content of that page generates dynamically so you can't grab them with the way you tried above. You need to use any browser simulator or try this URL to fetch the result. The latter is much more efficient and easy to deal with. Give it a go.
import requests
r = requests.get('https://archive.cnx.org/extras')
for item in r.json()['featuredLinks']:
print(item['title'])
Result:
Applied Probability
Understanding Basic Music Theory
Programming Fundamentals - A Modular Structured Approach using C++
Advanced Algebra II: Conceptual Explanations
Flowering Light: Kabbalistic Mysticism and the Art of Elliot R. Wolfson
Hearing Harmony: What is Harmony?
Upvotes: 0
Reputation: 3454
The page you are fetching (defined in the openstax
variable) is generated on the client side using javascript
. So the final html isn't present on the response to the request you make using your code.
Because of this, when you search page_soup.findAll("div",{"class":"book"})
, it isn't returning any elements, which in turn explains the json file being an empty array.
As it is stated on the returned html
of that page, in the noscript
element, you should try using the http://legacy.cnx.org/content
url if you don't want use the javascript
rendered webpage.
Upvotes: 1