beautifulsoup doesn't show all ellements

Question

i'm trying to parse Taobao website and get information about Goods (photo , text and link ) with BeautifulSoup.find but it doesn't find all classes.

url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'

def get_html(url):
    r = requests.get(url)
    return r.text

html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})

z-is empty. but for example:

z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3

works fine

Why this approach doesn't work? What should i do to get all information about good? i am using python 2.7

Vin&#237;cius Figueiredo · Accepted Answer

So, it looks like the items you want to parse are being built dynamically by JavaScript, that's why soup.text.find("J_TItems") returns -1, i.e. there's no "J_TItems" at all in the text. What you can do is use selenium with a JS interpreter, for a headless browsing you can use PhantomJS like this:

from bs4 import BeautifulSoup
from selenium import webdriver

url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'

browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})

Note the items you want are inside each row defined by

, and there are 5 of them (you maybe only want the first three, because the other two are not really inside the main search, filtering that should not be difficult, a simple rows = rows[2:] do the trick):

rows = JTitems.findAll("div", attrs={"class":"item4line1"})
>>> len(rows)
5

Now notice each "Good" you mention in the question is inside a

, so you need to get them all in a for loop:

Goods = []    
for row in rows:
    for item in row.findAll("dl", attrs={"class":"item"}):
        Goods.append(item)

All there's left to do is to get "photo, text and link" as you mentioned, and this can be easily done accessing each item in Goods list, by inspection you can know how to get each of the information, for examples, for picture url a simple one-line would be:

>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'

beautifulsoup doesn't show all ellements

Answers (1)

Related Questions

beautifulsoup doesn&#39;t show all ellements

Answers (1)

Related Questions

beautifulsoup doesn't show all ellements