Taewoo.Lim
Taewoo.Lim

Reputation: 223

beautifulsoup python class parse

I want to get product_list by parsing a website

soup = bs(product_list_get.text, 'html.parser')
productlist = soup.find_all('td',{'class':'txtCode'})

some part of result is as follows

[<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>, <td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>

what i want to get is list of product_no

so the optimal result would be

[42,41]

I tried

productlist = soup.find_all('td',{'class':'txtCode'}).get('product_no')

but the result is

AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

can anyone kindly guide me how to deal with this?

Upvotes: 0

Views: 139

Answers (2)

KC.
KC.

Reputation: 3107

product_no is contained inside href, so you need to extract href. Then you can use regex to match product_no

from bs4 import BeautifulSoup
import re

lists = [
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>""", 
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>"""]

for each in lists:
    soup = BeautifulSoup(each,"lxml")
    href = soup.a.get("href")
    product_no = re.search(r"(?<=product_no=)\w+",href).group(0)
    print(product_no)
#42
#41

Upvotes: 1

qwermike
qwermike

Reputation: 1486

The method find_all returns list of Tag elements. So your code productlist = soup.find_all('td',{'class':'txtCode'}) returns a list of <td> elements. You want to get the attribute number_no of inner <a> element for each <td> you found.

Iterate over the productlist and access the number_no.

productlist = soup.find_all('td', {'class':'txtCode'})
product_nos = [int(p.find('a').get('product_no')) for p in productlist]

Alternatively, you can find <a> elements, which contain attribute product_no.

results = soup.find_all('a', {'product_no':True})
product_nos = [int(r.get('product_no')) for r in results]

Upvotes: 1

Related Questions