Reputation: 5
I'm making a python script to give the top 5 featured projects on the website scratch.mit.edu. I am using requests to get the data. The element that has the title of those projects is in a div tag, but when I use bs4, it shows no children or descendants of the div tag. How can I look inside of the tag?
I've tried find_all(), find(), .descendants, and .children.
soup.find("div").children
I expected the output of < div id="page">
Upvotes: 0
Views: 468
Reputation: 84465
API
Use the api the page uses to update content and parse from json response
https://api.scratch.mit.edu/proxy/featured
import requests
import pandas as pd
r = requests.get('https://api.scratch.mit.edu/proxy/featured').json()
project_info = [(item['title'], 'https://scratch.mit.edu/projects/' + str(item['id'])) for item in r['community_featured_projects'][:6]]
df = pd.DataFrame(project_info , columns = ['Title', 'Link'])
print(df.head())
Selenium
Or, sub-optimal choice, as content is dynamically rendered you could use a method like selenium:
Restrict to the first "box" and then select the child a
tags of the thumbnail-title
classes and index into list for top 5/ or df.head()
.box:nth-of-type(1) .thumbnail-title > a
py (as noted by @P.hunter - you could run this headless)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd
options = Options()
options.add_argument("--headless")
d = webdriver.Chrome(options = options)
d.get('https://scratch.mit.edu/')
project_info = [(item.get_attribute('title') ,item.get_attribute('href') ) for item in WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".box:nth-of-type(1) .thumbnail-title > a")))]
df = pd.DataFrame(project_info , columns = ['Title', 'Link'])
d.quit()
print(df)
Upvotes: 2