jko0401
jko0401

Reputation: 47

Scrolling with Selenium to scrape more data not getting more than 50 or 1000 elements

I want to create a list of all of the diamonds' URLs in the table on Blue Nile, which should be ~142K entries. I noticed that I had to scroll to load more entries so the first solution I implemented for that is to scroll to the end of the page first before scraping. However, the max number of elements scraped would only be 1000. I learned that this is due to issues outlined in this question: Selenium find_elements_by_id() doesn't return all elements but the solutions aren't clear and straightforward for me.

I tried to scroll the page by a certain amount and scrape until the page has reached the end. However, I can only seem to get the initial 50 unique elements.

driver = webdriver.Chrome()
driver.get("https://www.bluenile.com/diamond-search?pt=setform")
source_site = 'www.bluenile.com'
SCROLL_PAUSE_TIME = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
print(last_height)
new_height = 500
diamond_urls = []
soup = BeautifulSoup(driver.page_source, "html.parser")
count = 0

while new_height < last_height:
    for url in soup.find_all('a', class_='grid-row row TL511DiaStrikePrice', href=True):
        full_url = source_site + url['href'][1:]
        diamond_urls.append(full_url)
        count += 1
    if count == 50:
        driver.execute_script("window.scrollBy(0, 500);")
        time.sleep(SCROLL_PAUSE_TIME)
        new_height+=500
        print(new_height)
        count = 0

Please help me find the issue with my code above or suggest a better solution. Thanks!

Upvotes: 0

Views: 318

Answers (1)

cullzie
cullzie

Reputation: 2755

As a simpler solution I would just query their API (Sample below):

https://www.bluenile.com/api/public/diamond-search-grid/v2?startIndex=0&pageSize=50&_=1591689344542&unlimitedPaging=false&sortDirection=asc&sortColumn=default&shape=RD&maxDateType=MANUFACTURING_REQUIRED&isQuickShip=false&hasVisualization=false&isFiltersExpanded=false&astorFilterActive=false&country=USA&language=en-us&currency=USD&productSet=BN

One of the response parameters of this endpoint is the countRaw which is 100876. Therefore it should be simple enough to iterate over in blocks of 50 (or more you just don't want to abuse the endpoint) until you have all the data you need.

Hope this helps.

Upvotes: 1

Related Questions