Reputation: 88
i had problem with getting href
tag, so my case like this,
this is the html
file :
<div class="list-product with-sidebar">
<a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">
</a>
<a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">
</a>
</div>
so here my code
def get_category_item_list(category):
base_url = 'https://www.website.com/'
res = session.get(base_url+category)
res = BeautifulSoup(res.content, 'html.parser')
all_title = res.findAll('a', attrs={'class':'frame-item'})
data_titles = []
for title in all_title:
product_link = title.get('a')['href']
data_titles.append(product_link)
return data_titles
what i want to get is, href
links.. like this
produk-a.html
produk-b.html
when i try to run it.. it wont let me get link on href
, they give error code :
TypeError: 'NoneType' object is not subscriptable
Upvotes: 5
Views: 11480
Reputation: 541
You didn't share with us the website, so one problem might be that the website block User Agents that looks like a bot (requests's user agent). Debugging may help here, you can print the content of the page with resp.content/text
.
I created an HTML file called index.html
and then I read the file and scrape it's content. I changed a little bit the code and it seems to be work fine.
soup.find
returns an <class 'bs4.element.Tag'>
, so you can access it's attributes with attribute['a']
.
from bs4 import BeautifulSoup
with open('index.html') as f:
html_content = f.read()
soup = BeautifulSoup(html_content, 'html.parser')
data_titles = []
for a in soup.find('div', class_='list-product with-sidebar').find_all('a'):
data_titles.append(a['href'].split('/')[1])
print(data_titles)
# ['produk-a.html', 'produk-b.html']
index.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Document</title>
</head>
<body>
<div class="list-product with-sidebar">
<a
class="frame-item"
href="./produk-a.html"
target="_blank"
title="Produk A"
>
</a>
<a
class="frame-item"
href="./produk-b.html"
target="_blank"
title="Produk B"
>
</a>
</div>
</body>
</html>
Upvotes: 2
Reputation: 17408
For your exact output,
from bs4 import BeautifulSoup
html = """<div class="list-product with-sidebar">
<a class="frame-item" href="./produk-a.html" target="_blank" title="Produk A">
</a>
<a class="frame-item" href="./produk-b.html" target="_blank" title="Produk B">
</a>
</div>"""
res = BeautifulSoup(html, 'html.parser')
for a in res.findAll('a', attrs={'class':'frame-item'}):
print(a["href"].split("/")[-1])
Output:
produk-a.html
produk-b.html
Upvotes: 2
Reputation: 11942
I believe that your problem lies in this line :
product_link = title.get('a')['href']
You already have a list of "a" elements, so you probably just need :
product_link = title['href']
Upvotes: 6