V.Anh
V.Anh

Reputation: 535

Scrape websites with infinite scrolling using selenium and beautifulsoup return repeated elements

So i have script which uses Selenium and BeautifulSoup to scrape this website: 'http://m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'

But my script keep printing the first 8 elements of the page and disregard the contents appeared when scrolling. This is the script:

# -*- coding: utf-8 -*-
from urllib import urlopen
from bs4 import BeautifulSoup as BS
import unicodecsv as ucsv
import re 
from selenium import webdriver
import time 

with open('list1.csv','wb') as f:
w = ucsv.writer(f, encoding='utf-8-sig')

driver = 
webdriver.Chrome('C:\Users\V\Desktop\PY\web_scrape\chromedriver.exe')
base_url = 'http://m.1688.com/page/offerlist.html?
spm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'
driver.get(base_url)
pageSource = driver.page_source
lst = []
for n in range(10): 
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    soup = BS(pageSource, 'lxml')
    container = soup.find('div', {'class' : 'container'})
    items = container.findAll('div', {'class' : 'item-inner'})
    for item in items:
        title = item.find('div', {'class' : 'item-price'}).text
        title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
        lst.append(title_)
    print lst
    time.sleep(5)

The output for each scroll is:

[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']

The first scroll the list has 8 elements, the second scroll the list has 16 elements, the extra 8 elements is repeated from the first scroll. Same thing happens for the rest scrolls. So the script only return 8 elements even when i use selenium to scroll the site but i want it to print out all elements while scrolling. I would really appreciate it if you guys give me some advices.

Upvotes: 1

Views: 2277

Answers (3)

V.Anh
V.Anh

Reputation: 535

I have found an answer to the problem, by putting the pageSource into the loop and instead of hiding the Chrome in the taskbar, you have to open it or you could use PhantomJS instead of Chrome driver.

for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
pageSource = drive.page_source
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
    title = item.find('div', {'class' : 'item-price'}).text
    title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
    lst.append(title_)
print len(lst)

Now the output will change, instead of

8
8
8
8

It will print

16
20
28
...

Upvotes: 0

Leandro Muto
Leandro Muto

Reputation: 66

There are two possibilities:

  1. let the infinite scroll finish and then get the data;
  2. after every content reload, you can compare the data that you already have with the new data and then add it to the list.

Upvotes: 0

Dmitriy Fialkovskiy
Dmitriy Fialkovskiy

Reputation: 3225

The problem is in this part:

items = container.findAll('div', {'class' : 'item-inner'})
    for item in items:
        title = item.find('div', {'class' : 'item-price'}).text
        title_ = ''.join(i for i in title if ord(i) < 128  if i != '\n')
        lst.append(title_)

Each time you "scroll" the items object becomes one block bigger because when you scroll, the upper content doesn't go away. You need to get rid of first n-1 items from items to escape duplication.

Upvotes: 2

Related Questions