Reputation: 19
I was trying to get all project titles and creator names by webscraping and most of it is working, but I got a "TimeoutException: Message:" when I was trying to scrape infinite scrolling pages with "load more" button. Please let me know what is wrong and what i need to correct. Thanks
Below is the code currently being used:
from bs4 import BeautifulSoup
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("https://www.kickstarter.com/discover/advanced?sort=newest&seed=2695789&page=1/")
button = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'bttn keyboard-focusable bttn-medium bttn-primary theme--create fill-bttn-icon hover-fill-bttn-icon')))
button.click()
names=[]
creators=[]
soup = BeautifulSoup(driver.page_source)
for a in soup.findAll('div',{'class':'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
name=a.find('div', attrs={'class':'clamp-5 navy-500 mb3 hover-target'})
creator=a.find('div', attrs={'class':'type-13 flex'})
names.append(name.h3.text)
creators.append(creator.text)
df = pd.DataFrame({'Name':names,'Creator':creators})
Upvotes: 0
Views: 708
Reputation: 2813
You really need not to use the Beautiful Soup
and selenium
. Go for requests
library and its easy to grab it all hassle free.
import requests
import json
records = []
for i in range(5):
req = requests.get('https://www.kickstarter.com/discover/advanced?google_chrome_workaround&woe_id=0&sort=newest&seed=2695910&page='+str(i),
headers={'Accept': 'application/json',
'Content-Type': 'application/json'})
if(req.status_code == 200):
josn2 = req.json()
projects = josn2.get("projects")
for i in range(len(projects)):
print("Project Name - " + projects[i]['name'],end=' Created By - ')
print(projects[i]['creator'].get('name'))
print("----------------")
Output:
you can scroll down to the page as many time loadmore button loads the content put that much count in the for loop you will get all the content.
Upvotes: 1