Ryan Whitfield
Ryan Whitfield

Reputation: 27

Retrieving all information from page BeautifulSoup

I am attempting to scrape the urls of the products on an OldNavy webpage. However, it is only giving parts of the products list instead of the whole thing (for example, giving only 8 urls when there are way more than 8). I was hoping someone could help and identify what the problem may be.

from bs4 import BeautifulSoup
from selenium import webdriver
import html5lib
import platform
import urllib
import urllib2
import json


link = http://oldnavy.gap.com/browse/category.do?cid=1035712&sop=true
base_url = "http://www.oldnavy.com"

driver = webdriver.PhantomJS()
driver.get(link)
html = driver.page_source
soup = BeautifulSoup(html, "html5lib")
bigDiv = soup.findAll("div", class_="sp_sm spacing_small")
for div in bigDiv:
  links = div.findAll("a")
  for i in links:
    j = j + 1
    productUrl = base_url + i["href"]
    print productUrl

Upvotes: 2

Views: 4195

Answers (1)

furas
furas

Reputation: 142804

This page uses JavaScript to load elements but it loads only when you scroll down page.

It is called "lazy loading".

You have to scroll page too.

from selenium import webdriver
from bs4 import BeautifulSoup
import time

link = "http://oldnavy.gap.com/browse/category.do?cid=1035712&sop=true"
base_url = "http://www.oldnavy.com"

driver = webdriver.PhantomJS()
driver.get(link)

# ---

# scrolling

lastHeight = driver.execute_script("return document.body.scrollHeight")
#print(lastHeight)

pause = 0.5
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause)
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight
    #print(lastHeight)

# ---

html = driver.page_source
soup = BeautifulSoup(html, "html5lib")

#driver.find_element_by_class_name

divs = soup.find_all("div", class_="sp_sm spacing_small")
for div in divs:
    links = div.find_all("a")
    for link in links:
    print base_url + link["href"]

Idea: https://stackoverflow.com/a/28928684/1832058

Upvotes: 5

Related Questions