Reputation: 488
i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
def get_html(url):
r = requests.get(url)
return r.text
html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})
z-is empty. but for example:
z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3
works fine
Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7
Upvotes: 2
Views: 2259
Reputation: 6518
So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems")
returns -1
, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium
with a JS interpreter, for a headless browsing you can use PhantomJS
like this:
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})
Note the items you want are inside each row defined by <div class="item4line1">
, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:]
do the trick):
rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5
Now notice each "Good" you mention in the question is inside a <dl class="item">
, so you need to get them all in a for
loop:
Goods = []
for row in rows:
for item in row.findAll("dl", attrs={"class":"item"}):
Goods.append(item)
All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods
list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'
Upvotes: 2