Reputation: 535
So i have script which uses Selenium and BeautifulSoup to scrape this website: 'http://m.1688.com/page/offerlist.htmlspm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'
But my script keep printing the first 8 elements of the page and disregard the contents appeared when scrolling. This is the script:
# -*- coding: utf-8 -*-
from urllib import urlopen
from bs4 import BeautifulSoup as BS
import unicodecsv as ucsv
import re
from selenium import webdriver
import time
with open('list1.csv','wb') as f:
w = ucsv.writer(f, encoding='utf-8-sig')
driver =
webdriver.Chrome('C:\Users\V\Desktop\PY\web_scrape\chromedriver.exe')
base_url = 'http://m.1688.com/page/offerlist.html?
spm=a26g8.7664812.0.0.R19GYe&memberId=zhtiezhi&sortType=tradenumdown'
driver.get(base_url)
pageSource = driver.page_source
lst = []
for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
print lst
time.sleep(5)
The output for each scroll is:
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
[u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00', u'0.21', u'0.45', u'1.10', u'3.60', u'2.20', u'6.80', u'1.40', u'3.00']
The first scroll the list has 8 elements, the second scroll the list has 16 elements, the extra 8 elements is repeated from the first scroll. Same thing happens for the rest scrolls. So the script only return 8 elements even when i use selenium to scroll the site but i want it to print out all elements while scrolling. I would really appreciate it if you guys give me some advices.
Upvotes: 1
Views: 2277
Reputation: 535
I have found an answer to the problem, by putting the pageSource into the loop and instead of hiding the Chrome in the taskbar, you have to open it or you could use PhantomJS instead of Chrome driver.
for n in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
pageSource = drive.page_source
soup = BS(pageSource, 'lxml')
container = soup.find('div', {'class' : 'container'})
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
print len(lst)
Now the output will change, instead of
8
8
8
8
It will print
16
20
28
...
Upvotes: 0
Reputation: 66
There are two possibilities:
Upvotes: 0
Reputation: 3225
The problem is in this part:
items = container.findAll('div', {'class' : 'item-inner'})
for item in items:
title = item.find('div', {'class' : 'item-price'}).text
title_ = ''.join(i for i in title if ord(i) < 128 if i != '\n')
lst.append(title_)
Each time you "scroll" the items
object becomes one block bigger because when you scroll, the upper content doesn't go away.
You need to get rid of first n-1
item
s from items
to escape duplication.
Upvotes: 2