haha
haha

Reputation: 19

Scraping infinite scrolling pages with load more button using Selenium

I was trying to get all project titles and creator names by webscraping and most of it is working, but I got a "TimeoutException: Message:" when I was trying to scrape infinite scrolling pages with "load more" button. Please let me know what is wrong and what i need to correct. Thanks

Below is the code currently being used:

from bs4 import BeautifulSoup
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get("https://www.kickstarter.com/discover/advanced?sort=newest&seed=2695789&page=1/")

button = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,'bttn keyboard-focusable bttn-medium bttn-primary theme--create fill-bttn-icon hover-fill-bttn-icon')))
button.click()

names=[]
creators=[] 
soup = BeautifulSoup(driver.page_source)
for a in soup.findAll('div',{'class':'js-react-proj-card grid-col-12 grid-col-6-sm grid-col-4-lg'}):
    name=a.find('div', attrs={'class':'clamp-5 navy-500 mb3 hover-target'})
    creator=a.find('div', attrs={'class':'type-13 flex'})
    names.append(name.h3.text) 
    creators.append(creator.text)

df = pd.DataFrame({'Name':names,'Creator':creators}) 

Upvotes: 0

Views: 708

Answers (1)

Dev
Dev

Reputation: 2813

You really need not to use the Beautiful Soup and selenium. Go for requests library and its easy to grab it all hassle free.

import requests
import json
records = []
for i in range(5):
    req = requests.get('https://www.kickstarter.com/discover/advanced?google_chrome_workaround&woe_id=0&sort=newest&seed=2695910&page='+str(i),
                       headers={'Accept': 'application/json',
    'Content-Type': 'application/json'})
    if(req.status_code == 200):
        josn2 = req.json()
        projects = josn2.get("projects") 
        for i in range(len(projects)):
            print("Project Name - " + projects[i]['name'],end='          Created By - ')
            print(projects[i]['creator'].get('name'))

        print("----------------")

Output:

enter image description here

you can scroll down to the page as many time loadmore button loads the content put that much count in the for loop you will get all the content.

Upvotes: 1

Related Questions