Reputation: 35
Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None
or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
Upvotes: 1
Views: 418
Reputation: 99
To scrape data from jsx you actually need a scrapper like selenium webdriver.
You should firstly download and install selenium webdriver for your browser. Below the solution for Chrome browser:
https://chromedriver.chromium.org/downloads
And just tested code below:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.empireonline.com/movies/features/best-movies-2/")
sel_element = driver.find_elements(By.TAG_NAME, "h3")
new_list = []
for element in sel_element:
text = element.text
new_list.append(text)
for item in new_list[::-1]: # revers from last to 1st element(because 1st element is 100)
with open("100_movies.txt", mode="a", encoding="utf-8") as file:
file.write(f"{item}\n")
Upvotes: 0
Reputation: 116
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.
Upvotes: 1