Reputation: 43
I want to get href
from a = soup.find_all('div', class_='email-messages')
.
[<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, '[email protected]')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>]
My code:
soup = BeautifulSoup(html_doc, 'lxml')
a = soup.find_all('div', class_='email-messages')
for link in a:
print(link['href'])
I got error:
in __getitem__
return self.attrs[key]
KeyError: 'href'
Upvotes: 1
Views: 192
Reputation: 4965
For "single-purpose" scraping it is quite useful to make use of parser customization, SoupStrainer
. It is faster (or it should be!) since it localize only the desired portion of the document to be scraped. Details here.
The SoupStrainer
instance must always passed as key-value pair of a BeautifulSoup
instance with key parse_only
:
from bs4 import BeautifulSoup, SoupStrainer
html_doc = # see above
soup = BeautifulSoup(html_doc, 'lxml', parse_only=SoupStrainer('a', href=True))
for tag in soup:
print(tag['href'])
Output
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete
Remember
soup
object and not with a list. So the loop variable is a bs4.element.Tag
object!SoupStrainer
has the same signature of the find_all
methodUpvotes: 2
Reputation: 195408
You're trying to get "href" from the <div>
tag. Try to find all <a>
tags inside the <div>
s:
from bs4 import BeautifulSoup
html_doc = """<div class="email-messages">
<table>
<tr>
<td id="email-title">Message Title</td>
<td id="email-sender">Sender</td>
<td id="email-control">Control </td>
</tr>
<tr>
<td><a href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1">Fwd: [Microsoft Academic Verification] Confirming Your Academic Status</a></td>
<td id="email-sender"><span data-cf-modified-c9b86b506f187bfdc48368eb-="" onclick="if (!window.__cfRLUnblockHandlers) return false; show_sender_email(this, '[email protected]')" style="cursor: pointer;">Tuấn Anh Vũ</span></td>
<td id="email-control"><a data-cf-modified-c9b86b506f187bfdc48368eb-="" href="/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete" onclick="if (!window.__cfRLUnblockHandlers) return false; return delete_mail('/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete');">[Delete]</a></td>
</tr>
<tr>
<td class="mail_message_counter" colspan="3">Total Messages: <strong>1</strong></td>
</tr>
</table>
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
divs = soup.find_all("div", class_="email-messages")
for div in divs:
for link in div.find_all("a"):
print(link["href"])
Prints:
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1
/en/msg/3EEB344D-505C-8CE7-09C5-2DD54F1AECD1/delete
Upvotes: 1