brawlins4
brawlins4

Reputation: 322

No Data in JSON Array - BeautifulSoup and Python 3

The script I have below keeps returning an empty array when I try to write the contents to a JSON file. There are no errors that pop up when the script in run. It does not print anything in the terminal either. I have some similar scripts for other websites that are working perfectly. Here is my code. Thanks in advance.

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json

openstax = 'https://cnx.org'

#opening up connection and grabbing page
uClient = urlopen(openstax)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"book"})

data = []
for container in containers:
   item = {}
   item['type'] = "Textbook"
   item['title'] = container.h3.a.text
   data.append(item)
   print(item['title']) 

with open("./json/openstax.json", "w") as writeJSON:
    json.dump(data, writeJSON, ensure_ascii=False)

Upvotes: 0

Views: 234

Answers (2)

SIM
SIM

Reputation: 22440

The content of that page generates dynamically so you can't grab them with the way you tried above. You need to use any browser simulator or try this URL to fetch the result. The latter is much more efficient and easy to deal with. Give it a go.

import requests

r = requests.get('https://archive.cnx.org/extras')
for item in r.json()['featuredLinks']:
    print(item['title'])

Result:

Applied Probability
Understanding Basic Music Theory
Programming Fundamentals - A Modular Structured Approach using C++
Advanced Algebra II: Conceptual Explanations
Flowering Light: Kabbalistic Mysticism and the Art of Elliot R. Wolfson
Hearing Harmony: What is Harmony?

Upvotes: 0

dethos
dethos

Reputation: 3454

The page you are fetching (defined in the openstax variable) is generated on the client side using javascript. So the final html isn't present on the response to the request you make using your code.

Because of this, when you search page_soup.findAll("div",{"class":"book"}), it isn't returning any elements, which in turn explains the json file being an empty array.

As it is stated on the returned html of that page, in the noscript element, you should try using the http://legacy.cnx.org/content url if you don't want use the javascript rendered webpage.

Upvotes: 1

Related Questions