Reputation: 13
When I want to scrape all data from https://www.britannica.com/search?query=world+war+2 I can't find all the elements. I am specifically looking for everything inside the div element with the class: search-feature-container (it's the content inside the info box at the top) , but when I scrape it just says that it found None. This is my code:
import requests
from bs4 import BeautifulSoup
def scrape_britannica(product_name):
### SETUP ###
URL_raw = 'https://www.britannica.com/search?query=' + product_name
URL = URL_raw.strip().replace(" ", "+")
## gets the html from the url
try:
page = requests.get(URL)
except:
print("Could not find URL..")
## a way to come around scrape blocking
soup = BeautifulSoup(page.content, 'html.parser')
parent = soup.find("div", {"class": "search-feature-container"})
print(parent)
scrape_britannica('carl barks')
I guess it has something to do with it not loading in the beginning when you open the page but I still don't know how to fix it. Or maybe it's cause the website is using Cookies. I'm literally looking for all the ideas I can get! Thx :D
Upvotes: 0
Views: 251
Reputation: 11515
You are dealing with a website
which is running JavaScript
to render it's data once the page loads, you can use the following approach which is loading the script
source of the website which containing the part which you are looking for it. Now you do have tree
and dict
, so you can do whatever with it.
import requests
from bs4 import BeautifulSoup
import json
r = requests.get("https://www.britannica.com/search?query=world+war+2")
soup = BeautifulSoup(r.text, 'html.parser')
script = soup.findAll(
"script", {'type': 'text/javascript'})[15].get_text(strip=True)
start = script.find("{")
end = script.rfind("}") + 1
data = script[start:end]
n = json.loads(data)
print(json.dumps(n, indent=4))
# print(n.keys())
# print(n["topicInfo"]["description"])
Output:
{
"toc": [
{
"id": 1,
"title": "Introduction",
"url": "/event/World-War-II"
},
{
"id": 53531,
"title": "Axis initiative and Allied reaction",
"url": "/event/World-War-II#ref53531"
},
{
"id": 53563,
"title": "The Allies\u2019 first decisive successes",
"url": "/event/World-War-II/The-Allies-first-decisive-successes"
},
{
"id": 53576,
"title": "The Allied landings in Europe and the defeat of the Axis powers",
"url": "/event/World-War-II/The-Allied-landings-in-Europe-and-the-defeat-of-the-Axis-powers"
}
],
"topicInfo": {
"topicId": 648813,
"imageId": 74903,
"imageUrl": "https://cdn.britannica.com/s:300x1000/26/188426-050-2AF26954/Germany-Poland-September-1-1939.jpg",
"imageAltText": "World War II",
"title": "World War II",
"identifier": "1939\u20131945",
"description": "World War II, conflict that involved virtually every part of the world during the years 1939\u201345. The principal belligerents were the Axis powers\u2014Germany, Italy, and Japan\u2014and the Allies\u2014France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...",
"url": "/event/World-War-II"
}
}
Output of print(n.keys())
dict_keys(['toc', 'topicInfo'])
Output of print(n["topicInfo"]["description"])
World War II, conflict that involved virtually every part of the world during the years 1939–45. The principal belligerents were the Axis powers—Germany, Italy, and Japan—and the Allies—France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...
Upvotes: 1
Reputation: 12684
I would find all tags: script and check if there is a keyword: featuredSearchTopic in it. Then I will convert the text into json (as a dictionary) then access the data 'description'.
import requests
from bs4 import BeautifulSoup
import json
def scrape_britannica(product_name):
### SETUP ###
URL_raw = 'https://www.britannica.com/search?query=' + product_name
URL = URL_raw.strip().replace(" ", "+")
## gets the html from the url
try:
page = requests.get(URL)
except:
print("Could not find URL..")
## a way to come around scrape blocking
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup)
for parent in soup.findAll("script"): #, {"class": "search-feature-container"})
if 'featuredSearchTopic' in str(parent):
txt = json.loads(parent.text.replace(';','').split('=')[-1])
print(txt.get('topicInfo').get('description'))
scrape_britannica('carl barks')
Result:
comic strip: Institutionalization: …Disney artists of them all, Carl Barks, sole creator of more than 500 of the best Donald Duck and other stories, was rescued from the oblivion to which the Disney policy of anonymity would consign him to become a cult figure. His Collected Works ran to 30 luxurious folio volumes.…...
Upvotes: 1