Reputation: 223
I want to get product_list by parsing a website
soup = bs(product_list_get.text, 'html.parser')
productlist = soup.find_all('td',{'class':'txtCode'})
some part of result is as follows
[<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>, <td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>
what i want to get is list of product_no
so the optimal result would be
[42,41]
I tried
productlist = soup.find_all('td',{'class':'txtCode'}).get('product_no')
but the result is
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
can anyone kindly guide me how to deal with this?
Upvotes: 0
Views: 139
Reputation: 3107
product_no
is contained inside href
, so you need to extract href
. Then you can use regex to match product_no
from bs4 import BeautifulSoup
import re
lists = [
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=42" product_no="42" target="_blank" title="새창 열림">P00000BQ</a></td>""",
"""<td class="txtCode"><a class="txtLink eProductDetail _product_code" href="/disp/admin/shop1/product/ProductRegister?product_no=41" product_no="41" target="_blank" title="새창 열림">P00000BP</a></td>"""]
for each in lists:
soup = BeautifulSoup(each,"lxml")
href = soup.a.get("href")
product_no = re.search(r"(?<=product_no=)\w+",href).group(0)
print(product_no)
#42
#41
Upvotes: 1
Reputation: 1486
The method find_all
returns list of Tag elements. So your code productlist = soup.find_all('td',{'class':'txtCode'})
returns a list of <td>
elements. You want to get the attribute number_no
of inner <a>
element for each <td>
you found.
Iterate over the productlist
and access the number_no
.
productlist = soup.find_all('td', {'class':'txtCode'})
product_nos = [int(p.find('a').get('product_no')) for p in productlist]
Alternatively, you can find <a>
elements, which contain attribute product_no
.
results = soup.find_all('a', {'product_no':True})
product_nos = [int(r.get('product_no')) for r in results]
Upvotes: 1